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Introduction 


Curiosity killed the cat. 

Wc like to think that it is human nature to be curious. Every day of our lives, 
we ask ourselves questions. We wonder why a lesson that worked well last term 
doesn't work this time. We wonder what makes birds sing, why some people 
speak almost in a monotone, what signals people use to show that some things 
they want to say are new and important. We wonder about almost everything. 
It is curiosity that drives research. 

If all of us are curious, does that make us all researchers? In one sense it does. 
We all wonder, we all ask questions, and we all search for answers to our 
questions. The ways we search for answers, however, will differ. The way we 
search differentiates simple curiosity from research. Research is the organized, 
systematic search for answers to the questions we ask. 

Artist and scholar Elisabeth Mann Borgese was once asked how the world viewed 
her interest in teaching dogs and chimpanzees to type. Did the world of science 
appreciate her comments that the messages typed by her setter Arlecchino formed 
poems that rivaled, indeed surpassed, those of many modern poets? The world, 
responded the artist, has two answers to everything—either "1 don't believe it" or 
"Oh, I knew that already." To these might be added a third response from the 
world of science—"Why should we care?" Let's tackle each of these comments in 
turn. 

Research should be carefully designed to meet the first of these, the "I don't be¬ 
lieve it" response. What is it that convinces us that answers are correct? This 
question requires that wc think about what led us to particular answers and what 
evidence supports those particular answers. It requires us to consider the chance 
that we are deluding ourselves-that the lesson really worked just as well or better 
for most students but didn't work at all for a few. In research, what convinces 
us and what makes us doubt answers must be clearly articulated if we wish to 
overcome the first response of Borgese's world-"! don't believe it!" 

Each of us as individuals has different criteria for believing or not believing the 
truth of claims. Different academic fields also have different criteria for accept¬ 
ing or rejecting the truth of proposed answers to questions. The first of these has 
to do with where we turn for answers. In many fields the first place to search for 
answers is to ask experts, colleagues, ourselves, and others. The "asking" may be 
quite informal or it may involve exceedingly complex questionnaire research. 
Such a research methodology is an excellent way to find answers to certain types 
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of questions. And, of course, it is not a good way to search for answers to other 
kinds of questions. 

In some fields another place to search for answers is to observe natural life situ¬ 
ations using an ethnographic approach or an approach where the researcher is a 
participant-observer. The data collected may simply be a set of organized notes 
or it may involve elaborately coded videotaped records. The observations may 
cover only one instance or may run for months or even years. Another place to 
look for answers may be existing texts or transcripts. The observations may cover 
only a few records or pages or may include transcripts of thousands of lines. The 
questions and the data used will determine just how complex the analysis will be. 

In some fields, more confidence is placed in answers where researchers have been 
able to manipulate or control the many factors that might influence or affect 
outcomes. When researchers have a good idea of the possible answers there 
might be to questions, they can manipulate variables in different ways to make 
certain that the variables truly act in the way they think they do. Experimental 
research allows researchers that option. 

In library research all these methods may be combined. By surveying previous 
research, we can discover not only where but how experts have searched for an¬ 
swers to particular questions. But, to evaluate their answers, it is necessary to 
understand the methods each researcher used. This might include an under¬ 
standing of questionnaire, experimental, ethnographic, participant-observer, and 
case study methods. In addition, we need to understand how the information 
given in these reports can be evaluated and combined as evidence in support (or 
nonsupport) of other candidate answers to questions. 

Since we believe that research is primarily a way of convincing ourselves and 
others of answers to questions, it is important to know where to search for an¬ 
swers and what counts as evidence. It is also important to know just what stan¬ 
dards any particular field may use as to what counts as valid evidence. In some 
fields, it is considered appropriate to ask ourselves questions, look within our¬ 
selves for answers, and, when asked why we believe our answers are correct, give 
examples as evidence in support of assertions. For example, in linguistics, it has 
been common practice to pose a question about acceptability of particular lan¬ 
guage forms and then answer it by "consulting native speaker intuition." If we 
think the sentence "He suggested me to take advanced statistics" is unacceptable, 
that is evidence enough. The use of example sentences (even those created by the 
researcher rather than actually used by someone) is also considered acceptable 
evidence in support of answers to research questions. Some fields consider "typ¬ 
ical examples" as evidence whether or not "typical" is precisely defined. 

In other fields, self-consultation and even typical examples fail to convince. The 
world that says "I don't believe it!" wants to know how many people think X (e.g., 
how many people consider "He suggested me to..." acceptable), how often X oc¬ 
curs (e.g., how often ESL learners produce such errors), or how much X was 
needed to influence someone to do Y or think Z (e.g., how many lessons would 
be needed to get students to use the structure correctly). Summary tables often 
appear in journals, giving this type of information in support of answers to 
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questions. As with the use of examples, tables aie veiy helpful in summai i/.ing 
information. However, tablcs--)ike exam pics--must be carefully interpreted be¬ 
fore we trust them as evidence. 

Many fields expect that the researcher will not only give examples, frequency ta¬ 
bles, or tables showing typical "averages," but that these will be annotated to 
show exactly how much confidence the reader can reasonably have that claims 
based on such findings are correct. Usually, such fields also place constraints on 
:the acceptability of research designs, data-gathering procedures, and methods of 
analysis. 

As individual researchers, none of us wants to make claims which are unwar¬ 
ranted. We want our findings to be both new and well substantiated so that we 
can face a world that already knows it all or doesn't believe us anyway. The way 
wc search for answers to our questions will determine our own confidence in 
sharing our findings. 

We also want to have confidence in answers given by others. We may be content 
with answers accepted by the field (we accept the answers and thus "know that 
already"). On the other hand, we may not be content without examining the ev¬ 
idence very carefully. Trust is a good thing to have but, when the cost of ac¬ 
cepting nr rejecting answers based on evidence is high, trust will not suffice. 
Careful examination of research reports requires not only common sense but also 
some degree of statistical literacy. 

Research, then, is a means of balancing confidence and anxiety. The balance 
grows out of the need to answer a question and the fear of giving or accepting 
inaccurate answers. A well-designed research project will allow us to offer an¬ 
swers in which we can feel confidence. While it is important to keep in mind that 
different ways of presenting evidence and different types of evidence are used in 
different fields, the first thing is to convince ourselves-to establish confidence in 
answers for ourselves. After all, we posed the questions in the first place and so 
we are the first audience for our answers. 

Let's turn to the second response of Elisabeth Mann Borgese's world. If the re¬ 
sponse is "I knew that already," the chances are that the world does not see the 
question as interesting, deserving of further research. What is it that makes a 
question interesting, worthy of research? This, too, varies from field to field. In 
a sense, the definition of what is an interesting question in any field defines the 
field at that point in time. In fact, it is often said that a field can be defined as 
a discipline in its own right to the extent that it has its own separate research 
agenda. 

In disciplines that are engaged in the search for an overriding theory, questions 
are defined as worthy of research only if they contribute to theory formation. 
All other questions are held in abeyance (semi-interesting but not on track just 
now) or considered as irrelevant. For example, linguists wishing to construct a 
theory of competence have considered research on such performance factors as 
hesitation markers or such system components as turn-taking irrelevant and un¬ 
interesting. There is nothing more discouraging than to learn that the question 


Introduction 3 



most interesting to you is not interesting to Lire field. However, the direction m 
which the field is moving often makes it impossible for researchers to find an 
audience for their work unless it fits into the current mold, the current direction 
of work in the field. 

In other fields where a central unifying theory is not a primary concern, research 
may branch in so many different directions that it is difficult to see how individ¬ 
ual projects relate to one another. For example, studies on language mixing and 
switching, work on foreigner talk discourse, descriptions of the acquisition of 
temporality, studies of bilingual aphasia, and the English spelling systems of 
Chicano fourth graders may never be presented in a way that reveals the theo¬ 
retical framework within which the studies were conducted. 

You can imagine the problems that this creates across fields when researchers 
present their findings to one another. When theory formation is not discussed 
and, as is often the case, when outside researchers have limited access to the 
underlying assumptions of the field, it is almost impossible for them to under¬ 
stand why these particular questions have been posed in the first place-the "who 
cares" response. On the other hand, when theory formation is central, outsiders 
may consider the theory irrelevant to their own research world and wonder why 
such issues are important. 

The "who cares" response (often voiced as "question is not interesting") is not 
uncommon in the research world. Unfortunately, such comments are too often 
taken at face value by researchers who, in turn, view these comments as coming 
from a group that is, at best, misguided and, at worst, arrogant and presumptu¬ 
ous. 

We are fortunate in applied linguistics because the range of interesting questions 
is not limited to research where evidence could support or disconfirm some part 
of a central theory. Rather, our research questions are interesting to the extent 
that they can (a) apply to theory formation or theory testing, (b) apply to practice 
(e.g., curriculum design, materials development, development of language policy, 
test development), or (c) apply to both theory and practice. 

While different fields determine what constitutes interesting questions and ap¬ 
propriate evidence, there are conventions in research that do cross fields of in¬ 
quiry. Research plans have been designed and methods of analysis devised that 
give us confidence in findings. Such plans present "tried and true" ways of look¬ 
ing for answers. The tried and true are not the only ways and, with guidance, 
you will find that new methods can easily be designed. Your research approach 
can be as flexible as you wish. You may, in fact, want to take a multimethod 
approach in your search for answers. There is never a "one and only one" way 
to carry out a project. There is never a "one and only one" correct way to analyze 
the data. Some ways, however, are much better for some questions and other 
ways for other questions. By understanding how and why different researchers 
opt for different methods, you should be much better equipped to choose among 
methods yourself and to evaluate the research of others. 
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In terms of confidence, we aie veiy foi lunate lo have access lo computers to help 
U s in our search for answers to questions. However accurate you may be in your 
computations, computers can be even more accurate. No matter how rapidly you 
can carry out computations, computers can do it faster. It is difficult, often, to 
get a computer to do what you want, and once done, the computer never knows 
if wha*. it did is appropriate. Or, as is often said, 'The data know not whence 
they came." Computers are even easier to fool than readers. They will believe 
you no matter what you tell them. They may issue you warning messages, but 
they will usually complain only if you use a language they cannot understand. 
They will never tell you whether you have compiled and analyzed your data in a 
sensible way. You (and your advisors) must determine that. The computer 
supplement that accompanies this manual will teach you how to carry out many 
of the analyses with a minimum of effort. 

To decide whether the methods and procedures you use are appropriate, you will 
need to understand basic research design and statistics. Wc believe that the best 
way to do this is to work on your own research questions. However, for the 
novice researcher, this is not always possible. For the novice, it may be as effi¬ 
cient to begin by exploring the range of available possibilities. 

Since this manual is written for the novice researcher, our approach will be to 
encourage you to form your own research questions while we present research 
questions and examples taken from the field of applied linguistics. Wc will look 
at ways to state research questions, to search for answers, to compile and analv/e 
data, and to present findings. To do this, we have divided the book into three 
major parts. Part I includes chapters on defining the research question and 
planning a research project to search for answers. These plans form the research 
proposal (which later becomes the basis of the final report). That is. part I covers 
activities that allow the researcher to plan a well-organized research project. Part 
II shows how the evidence, the findings, can be described using simple descriptive 
statistics and visual displays. Parts III and IV present a variety of statistical tests 
that tell us how much confidence we can place in statements made as answers to 
research questions. We hope to show the choices that are open to researchers in 
all of these steps. In working through the choices we offer, we believe you will 
develop a stronger notion of the principles that underlie research procedures. 
We also expect that you will become comfortable with and confident of the help 
that computers can give you. 

If you are not a novice researcher (and even if you are), you might want to turn 
to the pretest in appendix A now. If you know which procedures are appropriate 
for the data shown in the examples and if you feel comfortable using the terms 
listed in the pretest, then you already have the knowledge needed to meet the 
goals listed below. If you know some but not others, highlight the items you have 
yet to learn and focus on them during your study of this manual. If you are a 
novice, don't be discouraged! Just think how much you will accomplish by 
working through the manual. Our goals may seem overwhelming at the start, but 
you will be amazed at how much easier the pretest is once you have completed 
your work. 
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Our major goals are: 

1. To promote basic understanding of fundamental concepts of research design 

and statistics, which includes the following. 

a. Statistical literacy: the ability to read and evaluate research articles that 
include statistically analyzed data 

b. Statistical production: the ability to select and execute an appropriate 
procedure (with flexible options) 

2. To make research planning (both design and analysis) as easy as possible 

through the following. 

a. Simple practice exercises to be completed by hand 

b. Practice with interpretation of results 

c. Review of research reports that illustrate research design and statistical 
analysis 


The book has been written as a combination text and workbook. At the end of 
each chapter you will find an activities section that relates published studies to 
the material covered in the chapter. These activities are meant to give additional 
practice in evaluating research articles and to stimulate your own research inter¬ 
ests. In addition, you will find practice sections within each chapter. From past 
experience we know that students can read very rapidly through a discussion of 
a statistical procedure, feel that they understand and yet a moment later realize 
that they have not been able to "hold onto" the concepts presented. By including 
workbook features in the manual, we hope to slow down the reading process and 
allow practice to reinforce the concepts. Filling in blanks and working through 
examples as you read can be frustrating, especially if you are not sure that your 
answers are correct. For this reason an answer key has been provided for 
questions that have been marked with an arrowhead (►) in the text. As you will 
see, other questions ask for your opinions, your research ideas, and your criti¬ 
cisms. For these there are no correct answers. We hope these will lead to useful 
discussion in your study group or classroom. 

Once upon a time there may have been a cat that was killed by curiosity. More 
often, it is curiosity that is killed. We hope that the examples we have included 
in this book will stimulate, rather than kill, your curiosity about language learn¬ 
ing and language use. We believe that the study of research design and statistical 
analysis will give you exciting, new ways to satisfy that curiosity. 
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Part /. Planning a Research 
Project 



Chapter / 

Defining the Research Question 


a Sources of research questions 
« Scope of research 
* Feasibility of research 
» Stating research questions and hypotheses 
® Collecting research evidence 
® Internal and external validity of studies 

Research has been defined as a systematic approach to searching for answers to 
questions. This definition is straightforward but deceptively simple. To truly 
understand it. we must understand how systematicity relates to defining 
questions, to defining search, and to defining answers. 


Sources of Research Questions 

(n a recent study where teachers were encouraged to become teacher-researchers, 
we asked teachers to talk about their research questions. As you might expect, 
the teachers initially discounted the notion that they had interests which might 
qualify for serious research. For them, research was something that people d.d 
at universities or labs. They described research as having great theoretical scope, 
with numbers and symbols that made it difficult to understand, and while im¬ 
portant and valuable, of little use in terms of impact on their teaching. 

After that meeting, we wrote down our thoughts in our research journals. 

Do teachers have questions about teaching? Do they wonder aloud about 
teaching materials, about students, about techniques? Do they argue about 
curriculum design (e.g., whether a departmental arrangement where students 
have homeroom and then go to different teachers for math, English or ESL, 
science, history is "better" for fourth graders than a self-contained class¬ 
room ) ? Do they share their questions with students? with other teachers? 

Do teachers look for answers to their questions? If so, how do they go about 
doing this? If they find answers, do they share them with students and/or 
other teachers? 


Chapter 1. Defining the Research Question 9 



What qualifies as an important question? If research findings aren't likely 
to have an impact on teaching, what makes them valuable and important? 

Do teachers truly believe research is only conducted elsewhere by other peo¬ 
ple? If the research were not conducted at a university, would it be better? 
more relevant? less difficult to understand? 

How do teachers define "theory"? What kinds of questions have theoretical 
importance? Is theory testing less relevant to classroom practice than re¬ 
search that does not test theory? 

How can we go about changing teachers' notions about the nature of re¬ 
search? 

An endless list could be generated from this experience, but our basic question 
was whether our expectations for teacher-generated classroom research were via¬ 
ble. Imagine you were a member of a university faculty and your president asked 
all faculty members to state their research interests. Think about how you would 
respond. We suspect that people everywhere react in similar ways to such top -> 
down instructions regarding research interests. When asked to come up with a 
research question out of context and divorced from true curiosity, the results are 
probably predictable. We could compare such a situation to being asked to write 
a composition with an open-choice topic. Given lack of direction, it is difficult 
to think of anything and much time is spent complaining and/or worrying about 
how to please the person(s) who gave the directions. 

The first rule of research should be that the questions are our own, questions that 
we truly want to investigate. If we are going to invest the time, energy, and 
sometimes even funds in research, then the questions must be important to us. 
Of course, our questions are shaped by our experiences. Our teachers, colleagues, 
students, and the reading we do ail guide us towards the important issues in our 
field. Advice is often given that urges us to follow up on a "hot" topic. This is 
good advice not just because the topic is current but because others will want to 
know about the results. Other teachers may benefit from the research or other 
researchers will want to build on the research. If the topic is "current," it will also 
be easier to get the research report accepted at the conferences of our field 
(TESOL--Teaching English to Speakers of Other Languages; SLRF-Second 
Language Research Forum; AILA-International Association of Applied Lin¬ 
guistics; AAAL—American Association of Applied Linguistics; MLA--Modern 
Language Association; AERA--American Educational Research Association; 
NABE-National Association of Bilingual Education). It will be easier to pub¬ 
lish, and it may even help in getting a job. Nevertheless, unless the question is a 
question we care about, it is unlikely that we will have the motivation to sustain 
us in the research process. 

Another source of research questions can be found in the journals of our field. 
If you scan these, you will again see how wide the range of "interesting" questions 
can be. The research does fall into major categories which, of course, change over 
time as new and old areas become the center of research efforts. The areas in¬ 
clude classroom research, skills based research, learner characteristics, teacher 
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characteristics, language analysis (for teacher reference or materials develop¬ 
ment), language use outside the classroom, intcrlanguage analysis, language 
policy, planning, testing and evaluation, and theory testing. Even if you or.ly 
check the table of contents of a few of the journals listed in appendix D, you 
should soon understand how research in our field informs both theory and prac¬ 
tice. 

As you read the articles that interest you most, you may find the authors finish 
their reports with a list of "unanswered questions'-ideas for further research. 
Thus, journals show which questions are especially interesting at a particular 
time, the wide range of possibilities for research, and even offer us specific ideas 
for further research. 

Many times, student research topics arc defined by supervising faculty (research 
on noun complements, on reading comprehension, on reciprocity in 
argumentative prose are parceled out to participating student members for work). 
In such cases, personal pleasure and motivation can come from being a part of 
:the research team, but unless the question captures our imagination, the process 
can seldom be as satisfying as we might like. 

There :s another reason why we pick particular research topics. While we often 
research what we like or what appeals to us, wc also often research what we think 
we know already. We find ourselves pursuing a topic because, as we say, l ley, 
I have some knowledge of that now, I think I can contribute something to that 
anyway." 

There s a story about the writer William Taulkner that illustrates tins point, it 
seems -hat when Taulkner was young he left his native north-central Mississippi 
and struck out for the booming town of New Orleans, Louisiana. There he hoped 
to make a name for himself as a writer, lie did reach New Orleans, and did work 
about town, freelancing and writing whenever he could. He also produced his 
first two novels, considered by Faulkner scholars to be dismal failures. 

Faulkner sensed that his writing was not improving. One day, someplace in New 
Orleans, he ran across an acquaintance (also to become a well-known author), 
Sherwood Anderson. Faulkner expressed to Anderson his desire to become a 
great writer. Anderson replied something to the effect: "Then what are you doing 
here? Go home! Go back to Mississippi! Write about something that you know!" 

Faulkner took that advice and returned to Mississippi to begin his lifetime cycle 
of works about the area he knew best: Mississippi. His very next novel was The 
Sound and the Fury , considered by many to be a modern masterpiece and quite 
possibly the single work in his life most responsible for his Nobel Prize. 

One of the best ways to begin to define what interests you and your thoughts 
about what you already know is with a research journal. Each time you think 
of a question for which there seems to be no ready answer, write the question 
down. Someone may write or talk about something that is fascinating, and you 
wonder if the same results would obtain with your students, or with bilingual 
children, or with a different genre of text. Write this in your journal. Perhaps 
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you take notes as you read articles, observe classes, or listen to lectures. Place a 
star or other symbol at places where you have questions. These ideas will then 
be easy to find and transfer to the journal. Of course, not all of these ideas will 
evolve into research topics. Like a writer's notebook, these bits and pieces of re¬ 
search ideas will reformulate themselves almost like magic. Ways to redefine, 
elaborate or reorganize the questions will occur as you reread the entries. 

If you are not part of a research team, a second good way to begin is to form a 
study group. Here you can find "critical friends"-peopie who will feel free to ask 
all the critical questions they can and lend you the support that you may need. 
"Critical friends" (or "friendly enemies") are invaluable to the researcher. They 
help you reformulate your research questions, point out your errors and possible 
threats to the validity and reliability of the research, suggest other ways of gath¬ 
ering data, question your selection of statistical procedures, and argue over the 
interpretation of your results. Advisors can and do serve this function, but crit¬ 
ical help from a group of friends is even better. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 1.1 

1. As a study group assignment, start a research journal. Enter your notes and 
questions in the journal for one week. In your study group, compare the research 
interests of the group members. 

Report on range of interests:_ 


2. Select two articles from recent journals related to the field of applied linguis¬ 
tics. Attempt to fit each article into one of these broad categories. 

Classroom research_ 

Skills based research_ 

Learner characteristics_ 

Teacher characteristics_ 

Language analysis_ 

Language use_ 

Interlanguage analysis_ 

Language policy/'planning_ 

Testing and evaluation_ 

Program evaluation _ 

Theory testing_ 

Other_ 

3. What potential losses/gains do you see in advisor-generated research? Com¬ 
pare your responses with those of members of your study group. Report the 
variability you see in the responses and how you might account for different re¬ 
actions to this practice. 
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4. Your course instructor will decide how best to form study groups in s our class. 
However, think for a moment about what special characteristics you would look 
for in selecting your own "critical friends." List these below. 


(As a research project, you might want to collect the responses to the above 
question and, at the end of the term, ask the same question. Are you curious 
about whether and how the group process changes perceptions about which 
characteristics of "critical friends" truly are the most important?) 

Would you prefer to participate in a study group where interests are very similar 
or where there is a wide variety of interests? Why?_ 


ooooooooooooooooooooooooooooooooooooo 

Scope of Research 

Aside from personal interest, research questions need to have other character¬ 
istics. They should be able to generate new information or confirm old informa¬ 
tion in new ways. To be sure that this is the case, a review of the literature on 
the research topic must be done. Imagine that you arc interested in second lan¬ 
guage proficiency but instead of looking at students' use of grammatical struc¬ 
tures, you want to investigate how well they can perform basic "speech acts." 
Obviously this is not a topic that one would select without some acquaintance 
with speech act theory. You also would r.ot select the topic if you thought that 
it had already been sufficiently addressed. 

Still, the first thing to do is undertake a review of previous speech act research to 
learn exactly what has been done in this area with second language learners. If 
you went to your university library, you might be able to get a computer search 
for this topic. Such searches, like ordinary library searches, begin by looking at 
"key words" and "key authors." Key words and key authors for speech acts might 
include terms such as directive, assertive, commissive or such authors as Austin, 
Searle, Gumperz. Think for a moment about how broad a key word like "speech 
act" might be in such a search. While it is not as broad a key word as, say, lin¬ 
guistics, the number of articles and books generated by a search with this key 
word would be very large (and very broad). 
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A search using bilingual as a key word would also generate a huge number of 
items. Many of these would not be useful for, say, a language policy study of 
bilingualism in Peru. The question is how to narrow the scope of the search and 
at the same time find all the relevant entries. 

Hopefully, this question illuminates one of the first problems regarding the defi¬ 
nition of research questions-the questions are stated too broadly. To say we 
want to know more about how well second language learners carry out speech 
acts is a little like saying we want to know how well learners use language. Pre¬ 
cisely which speech acts do we want to investigate? What types of second lan¬ 
guage learners-beginners, advanced-are we talking about? Are the learners 
adult Korean immigrants in Los Angeles or Japanese high school students in 
Kyoto? In what kinds of situations should the speech events be investigated? Is 
the research meant to support test development? materials development? theory 
development? Where above we called for narrowing via key words, now we can 
narrow further via key sentences. 

Using these questions as a guide, we can redefine the research question by nar¬ 
rowing the scope. For example, the scope could be narrowed from: 

Investigate how well second language learners perform speech acts. 
to: 

Investigate Korean ESL students' ability to recognize complaint behavior 

appropriate in an academic university setting. 

Here "performance" has been narrowed from total performance to recognition 
(one portion of the total performance skill). "Second language learners" is nar¬ 
rowed to "Korean ESL students," and "speech acts" has been narrowed to one 
speech act subcategory "complaints." The events in which the subcategory might 
occur have been narrowed to those relevant to the university setting. There are, 
of course, many other ways in which the question might be narrowed. 

In narrowing the scope of the research, we may lose interest in the topic because 
it no longer addresses the larger question. An appropriate balance needs to be 
struck between scope and interest. It is possible to maintain the original research 
interest by carrying out a number of studies with limited scope. Together these 
studies would address the broader, general area of interest. 

A review of previous research will help us to define the scope of research in an¬ 
other way. We've already noted that the scope of the research must be realistic. 
But, even a fairly narrow question may need to be more carefully defined. Pre¬ 
vious researchers may already have done this. For example, many teachers are 
concerned with student motivation. Motivation, like bilingualism, is a very broad 
concept. Previous researchers have, however, grappled with this problem and 
have subcategorized the concept into types of motivation—for example, intrinsic 
and extrinsic motivation or instrumental and integrative motivation. In narrow¬ 
ing or subcategorizing the concept, operational definitions must be given to show 
the scope of the subcategory. 
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Sometimes well-established operational definitions exist for terms that are crucial 
to your research. Such a definition gives a "tried and true" definition and an ac¬ 
cepted method for observing or assessing it. However, sometimes there are no 
such accepted definitions or no agreement as to what the terms mean. There are, 
for example, many abstract theoretical concepts that have been "constructed" in 
our field. These constructs are shown in abstract terms such as acquisition, mo¬ 
tivation, need achievement, monitoring, compound bilingualism. We may share a 
basic understanding of such theoretical concepts, but even these theoretical defi¬ 
nitions are difficult to formulate. For example, precisely how would you define 
bilinguaP A commonly-shared definition of bilingual is "speaking two languages." 
We all know that the term may be applied to people who are at all points of 
fluency in the two languages (even to absolute beginners of a second language). 
To use such a term in research would be almost meaningless. A more precise 
definition is given, for example, Arabic-English bilinguals who scored a 3+ or 
higher on the FSI inventory participated in this study or Children who had partic¬ 
ipated in the Arabic immersion program in Cleveland schools in grades K-3 con¬ 
stitute the bilingual group in this study. 

When broad terms for constructs are used in research questions, wc cannot rely 
on a theoretical definition even if one is readily available. Terms must be "oper¬ 
ationally" defined. An operational definition is a clear statement of how you 
judged or identified a term in your research. This is important for three reasons. 
First, you will need to be absolutely consistent throughout the research process 
in your definition. Second, it is important for consumers of your research so that 
they do not misinterpret your findings. Third, it is important to the research 
community that your study be replicable. Different results might be obtained by 
other researchers if they carry out a similar project and use a different definition 
of bilingual. 

Good operational definitions can often be drawn from the existing literature. 
Sometimes, however, research is difficult to carry out because operational defi¬ 
nitions cannot be found that will satisfy the researcher. Sometimes no opera¬ 
tional definitions exist in the literature and the researcher must define terms. 
We know very little, for example, about how language is represented in the brain. 
Yet, many models of language acquisition talk about "acquisition devices," "fil¬ 
ters," "parameters," "LI -* L2 transfer" as internal mechanisms. It is, of course, 
possible to create and define an operational definition for these terms for an in¬ 
dividual project. A clear definition would be crucial to the research. (In some 
cases, we develop an operational definition for such concepts but then find our¬ 
selves questioning the "reality" of the concepts themselves. The attempt to es¬ 
tablish concepts is an important area of research.) 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 1.2 

1. Write a research question that narrows the scope for the study of speech acts 
in a different way. _ 
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What key words would you use to search for relevant studies for this research 
question?_ 


Compare your question with those written by members of your study group. 
Which statements still appear to need further narrowing of scope? How could 
this be accomplished?__ 


Which of the questions generated in your study group are good candidates for 
contributing answers to the large general study of speech act performance?_ 


How much more research would need to be generated to begin to answer the 
broad general question? How large a research team do you think might be 
needed to work on an issue with this scope? 


2. In our example of speech act research, we talked about "performance," but 
we still must operationally define "perform" so that we will know precisely how 
it is measured. How might "performance" be defined and measured? Has the 
definition further narrowed the scope of the research? 


3. Imagine that your research question contained the key words acquisition and 
LI -» L2 transfer. First write a definition giving your general understanding of 
these concepts. 

Acquisition _ 


Transfer 


Now write an operational definition for each term that shows precisely how you 
would measure or observe each. 

Acquisition in my study means _ 


and will be measured or observed by 


Transfer in my study means 
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und wi f l be measured or observed by 


How close do you think your operational definitions are to the theoretical defi¬ 
nitions of these constructs? 


How likely is it that the theoretical definitions of these concepts actually reflect 
reality (i.e., how metaphorical are they)? __ 


List other abstract constructs for which either operational definitions or the con¬ 
structs themselves might be problematic. 


4. Select one research topic from your research journal. Identify key terms and 
give operational definitions for each. Ask members of your study group to cri¬ 
tique these operational definitions and make recommendations of ways to narrow 
scope. Which of these recommendations can you use to improve the study? 


ooooooooooooooooooooooooooooooooooooo 

Feasibility of Research 

So far, we have suggested that research questions should 
I interest us 

2. promise new information or confirm old information in new ways 

3. have reasonable scope 

4. have key terms that are clearly defined and operationalized 

Before we turn to stating the questions in a more formal way. we need to consider 
whether or not the research is feasible. 

Many factors affect feasibility of research. To decide whether a research project 
is feasible or not means that you must know how much time the project will take 
and whether or not you have that amount of time to spend. When the topic is 
very broad—as that of language learners' performance of speech acts-it might 
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take a lifetime to investigate the topic. We have already talked about ways to 
narrow the scope of the research to make it more feasible. One of the major 
reasons we narrow scope is the amount of time wc have available to carry out the 
research. If your research is for a course and the course is only 10 or 18 weeks 
in duration, the scope must be tightly constrained. 

Assume that your sister and her husband have a first child. Your sister's hus¬ 
band speaks Spanish as a first language. They use Spanish at home rather than 
English. The other members of your extended family speak mainly English al¬ 
though, except for your mother who is monolingual English, they all use Spanish 
sometimes. You are very interested in investigating "bilingualism as a first lan¬ 
guage" since the baby, Luisa, will develop the languagc(s) simultaneously. It 
might take years, if not a lifetime, to complete such a project. 

Longitudinal studies , which follow an individual or group over a period of time, 
can be very time-consuming. This is one of the reasons that many researchers 
prefer a cross-sectional approach rather than a longitudinal study. In this ap¬ 
proach, data arc gathered (usually only once) from different groups of learners 
of different ages or different levels of proficiency. 

If we assume that the data of the longitudinal study were described in terms of 
actual age (in the following chart, the number before the colon stands for years 
and the number following it represents months), the cross-sectional equivalent 
might look like this: 

Longitudinal Study of Luisa 

0:9 1:0 1:3 1:6 1:9 2:0 

Cross-Sectional Study of 30 Children 

5 at 0:9 5 at 1:0 5 at 1:3 3 at 1:6 5 at 1:9 5 at 2:0 

The assumption is that the data of the children at each of these age levels would 
be similar to that of Luisa at that age. The advantage is that all the data could 
be collected at one time rather than spread out over two years. This makes the 
study more feasible. The problem might be in finding children exactly these ages, 
all of whom were simultaneously acquiring English and Spanish. 

For a research question where the order of acquisition (i.e., when forms first ap¬ 
pear) is more important than accuracy , another possibility is to incorporate more 
acquisition data from other children into the design. For example, you might try 
the following: 


0:9-1:0 

Time Periods for Observational Data 

1:1-1:3 1:4-1:6 1:7-1:9 1:10-1:3 

1:4-1:6 

Luisa 

Luisa 

Juan 

Juan 

Maria 

Maria 

Ernesto 

Ernesto 

Susan 

Susan 
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Tiiis would allow you to collect observational data in six months rathei than two 
years. The one-period overlap could help substantiate the similarity of develop¬ 
mental stages at these age levels. Again, it might be difficult to locate children 
of appropriate ages for the study. In addition, your research advisor might warn 
you that age is not a very satisfactory way to equate stages for such studies and 
suggest the use of a type of utterance length measure instead. That would mean 
collecting data from a fairly large group of children to find those who fit stages 
defined by this utterance length measure. 

Such quasi-longitudinal plans for data collection have the advantage of cutting 
the time span of the research, but the researcher must be able to locate appro¬ 
priate learners and have time to collect the data from each of them. 

Time, of course, is not the only thing to think about in determining how feasible 
a study might be. Imagine that you want to look at some aspect of language use 
by bilingual children in elementary school classrooms. If you are not already lo¬ 
cated at a school with a bilingual student enrollment, you may have great diffi¬ 
culty in gaining access to the classroom. For access to be granted, many schools 
and teachers require a statement that the research will not disrupt regular in¬ 
struction. In some school districts (and for good reasons), there is a monumental 
amount of red tape involved with school-based research, it simply may not be 
feasible for you to gain access. 

Feasibility may be determined not only by time and access, but also quantity anil 
quality of access. For example, assume the government was interested in finding 
nut the extent of bilingualism in your state. One way they could do this is by 
including questions on the next United States census questionnaire. First, the 
quality of the sample is likely to be biased given the existence of undocumented 
aliens, who may or may not want to be part of the census count. Second, 
quantity--the number of questions that could be allocated to this issue on the 
questionnairc—would be severely constrained. Third, the cost, if bilingual census 
takers must be found to conduct census taking, might not make the project fea¬ 
sible. A phone survey (using every «th telephone number in a directory) is an¬ 
other possibility, but we would need to know whether all people are equally likely 
to have phones or to be accessible to the phone during calling periods. 

The dollar cost of research may also determine the feasibility of the research. In 
planning a project, prepare a reasonable budget. Do you need tape recorders and 
tapes? Do you have the computer software you need for the study? If videotaped 
data are required for your study, are videocamera and tapes available? Can you 
operate the videocamera and observe a class at the same time, or must you hire 
a camera operator? Will the school and/or the learners expect some remuneration 
for participating in your study? Will you have to hire someone to help you code 
your data (to ensure that your codes arc well described and can be followed by 
anyone)? Do you need paper supplies, travel to and from a school, photocopies 
of 200 essays or computer entry of text data (perhaps via an optical character 
reader)? Try to make a complete list of everything you need. You don't want to 
near completion of the study and find that you don't have the last S5.95 you need 
for supplies. Think, then, about possible sources for funding for your project. 
Does your school have equipment you can borrow? Can you arrange a "help ex- 
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change" with other researchers rather than paying outsiders to help with ratings, 
coding, or data collection? Will the school district fund research such as yours? 
Could you apply for a grant through the National Science Foundation or the 
Office of Educational Research on Instruction? Are there nongovernmental 
groups that provide grants you could apply to? Does the school district or the 
university have an office that can assist you in preparing a grant proposal? If so, 
they may know good sources for funding. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 1.3 

1. Consider the study of Luisa. List other ways that the study might be made 
feasible. Discuss these in your study group and list any remaining feasibility 
problems for the study. 


2. If you hope to do research in a classroom, list the procedures needed to gain 
access to the classroom. Will restrictions be placed on the data-gathering proce¬ 
dure by the school? If so, will these restrictions make the study less feasible? 


3. In what kinds of studies might a telephone survey be less desirable than a 
personal interview survey? Why?_ 


4. As a study group assignment, select one of the limited-scope research topics 
from your research journal. Prepare a budget statement in the space below. In 
the statement include an estimate of time (and pay scale) for the principal re¬ 
searcher along with travel, supplies, and equipment estimates. (Every grant ap¬ 
plication has its own budget format, so the actual budget form will vary from that 
given below.) 
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Research Budget 


Total Cost_ 

List possible sources (and amounts) of funding or help for the research project: 


Assuming that you obtained the support you list above, would it be financially 
possible for you to carry out the research? If not, why? (For example, would you 
have to give up other jobs in order to carry out the research?) 


Discuss your budgets in your study group. List items which were overlooked in 
the various budgets. 

Group report:_ 


ooooooooooooooooooooooooooooooooooooo 

Stating Research Questions and Hypotheses 

Now that we have talked about the scope and feasibility of research questions, it 
is time to consider how these questions can be clearly stated. Imagine that you 
still wanted to describe the bilingual language development of the child, Luisa. 
Your basic research question was "Can I describe the bilingual language devel¬ 
opment of Luisa?" You realize that this is not feasible, so you have narrowed the 
question to "Can I describe the first 50 Spanish and first 50 English words ac¬ 
quired by Luisa?" 

Let's imagine what would happen if you, as a student researcher, brought this 
research question to your advisor. You soon might find yourself commiserating 
with Elisabeth Mann Borgcsc's grandmother. Your advisor might ask why this 
is an "interesting" question? Luisa's parents might find it fascinating (and, in 


Chapter 1. Defining the Research Question 21 



fact, they could easily be persuaded to collect the first 50 words for you). You 
might find it fascinating too. Luisa's development, however, is of interest only 
insofar as it can be interpreted as an indication of bilingual development in gen¬ 
eral (or as an indication of how lexical development differs in simultaneous ac¬ 
quisition of languages vs. sequential acquisition, or how the process might differ 
from that shown by monolingual children). What do you expect that the 50 
words might show you about the process? Do you expect to see a change in 
meanings over time? Do you expect to see parallel words in the two languages? 
Do you expect that the types of words used will relate to interactions with specific ■ 
people? Do you expect to be able to identify Luisa's meanings for these words 
accurately enough that you can check to see if more nouns are produced than any 
other category? Do you expect to be able to sec whether she uses her first words 
to label objects and people or whether she uses them to request actions? What j 
do you expect to sec in the data? 

We often begin research with questions like "Can I describe AT' rather than 
"Why do 1 want to describe XT' The answer to the "why" question shows us that 
we expect the description to have some bearing on a question which is important 
to the field. Can I describe" questions arc often difficult, tor we seldom know 
how we will go about this descriptive task until we begin to examine the data. 
However, it helps to think about possible outcomes ahead of time so that we are 
ready to look for particular relationships as we begin the study. That is, explor¬ 
atory research is seldom completely open. 

In illustrate this further, consider the work that is done in "needs-assessment" 
research. The basic question in such research is "Can I discover how A' perceive 
their needs regarding instruction (or whatever)?" Imagine that, as all teachers, 
you realize that you cannot possibly instruct your students in language and also 
in all the content material that they need to learn. You want to involve parents 
in the instruction process. You wonder which instructional needs parents see as 
being important and which of these they would take responsibility for at home. 
The research is of obvious importance to curriculum planning. It has high prac¬ 
tical interest, but it is of interest to the field only if findings can be taken as an 
indication of parental involvement in curriculum and instruction in a more gen- j 
eral sense. 

Again, if you ask yourself what you expect to find out, all sorts of new questions ! 
pop up. Do you expect all parents to respond in the same way? If not, what 
factors might help explain differences in their responses? Will you remember to 
tap these factors (e.g., will you collect information on education, age, work pat¬ 
terns, and so forth)? Do you imagine responses might be different if the students 
are already doing very well at school than if they are not? Do you have these 
data easily available so you can check this relationship? V/hat about older sib¬ 
lings? Should they be included as part of this study? Once we start thinking 
about what might happen in the data, we begin to think about how to explain the 
findings. We develop hunches about outcomes, and rephrase our questions tak¬ 
ing them into account. 
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: OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

practice 1.4 

I. Select one question from your research journal. Explain why it is an inter¬ 
esting" question for the field._ 


Frame the general research question for the study. 


Consider the possible outcomes of the study. What factors may enter into the 
research that might influence the outcomes?_______ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

It is at this point in the planning process that we should take time to read, to 
observe, and to think about the project very carefully. We don't want to plunge 
into a project without taking time to let our minds sort through the many possi¬ 
bilities. If you have kept a research journal, you will already have revised your 
: research question many times and thought about many alternatives and the pos¬ 
sible outcomes. It is important not to hurry the planning at this point. The rea¬ 
son it is important not to hurry is that our thought processes are not just analytic*, 
but holistic as well. One tiling that has come out of all the recent work on 'expert 
: systems" in cognitive science is that the mental representation of problems in¬ 
volves qualitative reasoning. "Expert" researchers don't approach problem solv- 
ing in a straightforward, analytic way. They give themselves time to sleep on it," 
to think about the problem when they aren't thinking about it. If you read widely 
and keep a research journal, you will soon find that you do this too. In the initial 
stages, novice researchers are pressed for time and this can lead to an end product 
in which the researcher finds little pleasure or pride. 

Usually when we have questions or when we wonder about something, we really 
do not know the answers for sure. That doesn't mean that we have no idea about 
what those answers might be or where to look for them. Our hunches about an¬ 
swers may come from reviewing the literature on the topic, from talking with 
colleagues, or from observing classrooms. These hunches about answers, when 
written in a formal way, are called hypotheses. We carry out the research to see 
if these hy potheses are supported or not. 

One very popular notion is that research is a way to prove that an answer is right 
or wrong. Given the exploratory nature of much research in our field, we would 
like to disabuse you of this notion at once. It will seldom be possible to prove that 
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T 

youi answer is the right one for the question. Instead, you should consider the 
research process as one in which you can collect evidence that will support or not 
support the relationship you want to establish or the hypotheses you have stated. 
Why this is so will be explained in much more depth in the following chapters. 

In formal terms, a hypothesis is a statement of possible outcome of research. I‘he 
hypothesis may be stated as a null hypothesis and as an alternative hypothesis. 
Imagine you wished to discover if an order exists in the acquisition of English 
spelling patterns. This is all you wish to do. You do not want to see whether the 
order is different according to the age of learners or whether the learner's LI 
might influence the order. In the null form, your hypothesis might be: 

There is no order of acquisition of English spelling patterns. 

While you might never write this null hypothesis out, it should be in the back of 
your mind as you collect and analyze your data since that is what the statistical 
test will test. The alternative hypothesis would be: 

There is an order of acquisition of English spelling patterns. 

I 

You hope that your data will allow you to discount the null hypothesis and give 
evidence in support of the alternative hypothesis. 

(As you will see later, it is possible to test an alternative hypothesis when there 
arc strong theoretical reasons to do so or when previous research has already al¬ 
lowed researchers to reject the null hypothesis. However, in our field, where 
replication studies are few and far between, it is more customary to test the null 
hypothesis.) 

The null hypothesis is often annotated as H 0 The alternative hypothesis is anno¬ 
tated as H 

Let's assume that spelling patterns have been scaled for difficulty-that is, there 
is a known order of difficulty for major and minor spelling patterns in English. 
The literature gives an operational definition of the spelling patterns with exam¬ 
ples of each and the established order of difficulty. This order, however, was es¬ 
tablished using spelling tests of native speakers of English. The question is 
whether the order is the same for second language learners. If it is, then ESL 
beginners should be able to show accurate performance only on the easiest pat¬ 
terns. They would place at the bottom of the spelling order continuum. The more 
advanced the learner, the higher he or she should place on the continuum of the 
known order scale of difficulty for spelling patterns. 

Now assume that you have been hired to design a course book on English spelling 
patterns for university students who are in intermediate and advanced LSL. 
classes. You would like to arrange the instruction to reflect the already estab¬ 
lished spelling continuum. Before you begin, though, you wonder if spelling er¬ 
rors of LSL students change as they become more proficient overall in '.he 
language. This time, unlike the previous example, you want to look at a re- 
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'lationship between two things--L2 proficiency and where students place on the 
spelling continuum. You might state the hypothesis in the following ways: 

H r There is no relationship between L2 proficiency and placement on the 
spelling continuum. 

This is the null hypothesis and it says that the evidence you gather will show no 
relation between a student's proficiency and placement on the continuum. If the 
null hypothesis is correct, then the continuum is useless as a guide to sequencing 
spelling patterns. If the null hypothesis is incorrect, then the continuum may be 
helpful. 

The alternative hypothesis would be: 

H- There is a relationship between L2 spelling proficiency and placemen: 
on the spelling continuum. 

With both forms, you can test the null hypothesis, H 0 , against the alternative 
hypothesis, H. . 

In addition, you may state the alternative hypothesis in a directional form (posi¬ 
tive or negative). That is, on the basis of previous research in the field, you may 
believe that a relationship does exist and that you can also specify the direction 
of the relationship. If other researchers have found that the scale "works" for 1.2 
Icatners-i.e., that the more proficient the learner is in general language develop¬ 
ment, the higher the placement on the continuum--you can use a directional hy¬ 
pothesis. If previous research suggests a positive direction, the directional 
hypothesis is in the positive form. 

H- 2 There is a positive relationship between L2 proficiency and placement 
on the spelling continuum. 

This says that the more proficient the student, the higher the placement on the 
i spelling continuum. If it is correct, then your data substantiate previous findings 
and give additional evidence for the use of the sequence in materials development. 

H 3 There is a negative relationship between L2 proficiency and placement 
on the spelling continuum. 

This says that the more proficient the student, the lower the placement on the 
continuum. This seems an unlikely hypothesis and would not be used unless 
previous research had suggested that the continuum which was established for 
1 first language learners not only does not apply to second language learners (in 
which case the null hypothesis would be correct) but that the continuum works 
in the opposite direction (the negative direction alternative hypothesis is correct). 
The spelling patterns in the continuum are reversed so that what was difficult for 
the 1.1 learner is easy for the 1.2 learner and vice versa! 
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ooooooooooooooooooocooooooooooooooooo 


Practice /.5 

► I. To practice stating and interpreting the meanings of these forms of hypoth¬ 
eses, assume that you wanted to look at the relationship of language proficiency 
and spelling test scores. How would you state and interpret the null and alter- j 
native directional hypotheses? 


Null hypothesis: 



Alternative hypothesis: 



Interpretation: 


Directional, positive hypothesis: 


Interpretation:_ 

Directional, negative hypothesis: 


Interpretation: 


ooooooooooooooooooooooooooooooooooooo 

In most research reports, the null hypothesis (even though it may not be formally 
stated) is tested rather than a directional alternative hypothesis. This is because 
there is seldom a body of research which has already established a relationship 
among the variables included in our research. Strange as it may seem, it is easier 
to find evidence that supports a directional hypothesis than it is to reject a null \ 
hypothesis. We will explain the reasons for this later. Nevertheless, there are ! 
times when a directional hypothesis is appropriate (when previous research has i 
shown evidence in this direction). Different statistics will be used based on this 
distinction of whether the hypothesis is directional. 

Sometimes it is necessary to write more than one hypothesis to cover the research 
question. For example, in the task of preparing materials for a spelling textbook, 
you might also want to know whether students from different first language 
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groups (as well as of differing proficicney levels) get the same scores on general 
tests of spelling. The null hypothesis for first language membership could be 
stated as: 

There is no relationship between first language membership and spelling test 
scores. 

This means you expect to find no difference among the groups. However, there 
is still a possibility that there could be large differences among students from 
different groups when students are beginning learners and that this difference 
might disappear over time so that there would be no difference among advanced 
learners. In such a case, our results would show an interaction where the effect 
of language proficiency interacts with the effect of the first language. This re¬ 
quires a hypothesis about the possible interaction between first language mem¬ 
bership and proficiency with spelling scores. 

There is no interaction between first language and proficiency and spelling 
test scores. 

i if the results are such that an interaction is found, then you cannot say that either 
first language or proficiency act alone. Rather, they interact so that differences 
in first language groups do show up at some levels of proficiency but not at oth¬ 
ers. 


ooooooooooooooooooooooooooooooooooooo 

Practice 1.6 

► 1. Imagine that the researcher was also interested in the possibility that men 
and women might differ in spelling scores. State a separate null hypothesis for 
the effect of sex on spelling scores (i.e., ignore proficiency). Then state all the 
interaction hypotheses (i.e., adding in proficiency and LI membership) in the null 
form. 

H c for sex:_ 


H a for sex and Ll: 


H 0 for sex and language proficiency: 


H 0 for sex, Ll, and language proficiency: 
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Three-way interactions are often difficult to interpret. If there were a three-way 
interaction here, it might show that females from certain LI groups at certain 
levels of proficiency performed differently from everyone else. If the H 0 for the 
three-way interaction could not be rejected, how would you interpret the finding? 


2. State the null hypothesis and alternative hypothesis for the research you de¬ 
fined on page 22. 

H 0 ___ 




Check the hypotheses generated in your study group. List suggestions for further 
clarification of your study. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

As we think about our hypotheses, we need to consider what kinds of evidence 
we can use to reject the null hypothesis. Just as there are caveats about scope and 
feasibility of research questions, there are some things to think about in terms of 
data gathering. As with our statements about scope and feasibility, these com¬ 
ments on data gathering are to help you to avoid (rather than to create) problems 
so that the research is easier to do. They should help rather than hinder your 
pursuit of evidence for answers to questions. 


Collecting Research Evidence 

In planning a research project, it is important to consider what kind(s) of evi¬ 
dence are needed so that your findings will allow you to support or reject your 
tentative answers (hypotheses). The next step is to determine the best way to 
collect the data. The data collection method is determined, in part, by the re¬ 
search question. Given the wide range of possible research questions in our field, 
it is impossible to review them here. However, we can discuss the issue of how 
to gather the evidence most efficiently. 

Unfortunately for our field, there is no central data bank to which we can turn 
for oral and written language data. However, almost every agency, university, 
or university professor does have data files and these files may already be entered 
into the computer. Carnegie-Mellon has a data bank for child language data. 
Your university may have the Brown corpus (1979) or the Lund corpus (1980) 
on line for your use. Individual faculty members may have precisely the data you 
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■ n ced to help answer your research question. In some cases, for example in 
government-sponsored research institutes or centers, data banks may be open to 
public use. In other cases, restrictions arc put on the data. 

Efficiency, of course, relates back to the amount of time you have available to 
carry out the project. If you have an unlimited amount of time, you might think 
of many different kinds of evidence and devise many different ways of gathering 
that evidence. The case presented would be exceedingly strong if you could use 
a multimethod approach. However, that luxury is not often available unless the 
project is undertaken as part of a research team effort. (This is one drawback 
of individual research. The balance between autonomy vs. efficiency in the re¬ 
search process is one faced by every researcher.) 

Consider, for the moment, that you wanted to look at the types of spelling errors 
written by men and women from different first language groups. To obtain data, 
you decided to contact English teachers in universities around the world and ask 
them to send you sample compositions of their advanced English students. There 
are a whole series of questions to be answered. How might you operationally 
define "advanced" so that you arc sure the samples will come from the appropri¬ 
ate groups? How many compositions will you need to obtain to have an appro¬ 
priate sample? Will words representing all the spelling patterns actually occur in 
all compositions? How difficult will it be to find and categorize each spelling er¬ 
ror? How long will it be before all the sample compositions actually arrive? 

The strength of this method of data collection is that the errors will be those 
committed by students during actual writing tasks. There arc, however, many 
weaknesses to this method that make it ill-advised. The data, if and when it ar¬ 
rived. would likely not contain examples of all the error types of interest. The 
examples might be extremely difficult to locate and categorize. 

Assume you had 48 hours to gather the data! Via electronic mail, you contacted 
administrators at ESL departments at thirty American universities who imme¬ 
diately agreed to administer a spelling test. The test requires students to select 
the best spelling (from four choices) of a list of 60 words. The test is computer¬ 
ized so that students enter their responses directly on the computer. There are 
spaces where students identify their first language background and sex. The re¬ 
sults (in this most impossible of worlds) are returned via electronic mail the next 
day. 

The method would certainly be efficient but the data collection procedure has 
changed the analysis of actual errors to the analysis of the ability to recognize 
correct spelling patterns. Subjects might be either intrigued, anxious, or bored 
with the task depending on their acquaintance with computers and electronic 
mail. Depending on their reactions, the method might or might not be effective. 

Efficiency isn't everything. If the method you use is dull or frightening or boring 
or takes too long, it's unlikely that your subjects (5s) will be motivated to perform 
as well as they might. One efficient method to trace semantic networks is by 
galvanic skin response (GSR). 5s are conditioned with a very slight electric cur¬ 
rent to a word, say chicken. A measurement is taken from (painless) electrodes 
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attached to the skin. Once the person is conditioned, other words are presented 
and no current is used. Nevertheless, there will be a skin response if the new 
words are linked to the word chicken. So, a reading would be obtained for duck 
and, perhaps, farm or bam and so forth. Would you agree to participate in re¬ 
search that used this method if we assure you that it is painless? A similar 
method is to flash a light and create an eye-blink or pupil restriction as a response 
to words. Again, once the conditioning is accomplished, the reaction will occur 
to words by networks. If you wanted to know whether bilingual 5s showed the 
same semantic network reactions in their two languages, you would more likely 
select this second option. It's just as efficient and not so frightening. 

Imagine that you wanted native speakers of French to judge the seriousness of 
12 types of gender errors in the speech of American students studying French. 
Each student in your class has presented a talk which you have recorded. Since 
there are 20 students in the class, this amounts to 10 hours of tape. You want 
the native speakers to listen to all 10 hours, scaling the seriousness of each error 
on a scale of 1 to 9. Would you agree to participate in this study as a judge (if 
you were a native speaker of French)? Probably not, since the time commitment 
would be great and the task fairly dull. The researcher would need to plan time 
blocks and ways of relieving boredom for the raters. 

The data collection method should not only motivate 5s to participate, but should 
allow them to give their best possible performance. Methods that work well with 
adults may not work well with adolescents or young children. The context of 
data collection has to be one that allows everyone to give their best performance. 
If not, our findings may lead us to make erroneous claims. For example, 
Piagetians, using certain kinds of tasks, have shown that young children are not 
supposed to be able to take another person's visual point of view. Donaldson 
(1978) created a new data collection method using a model with toy children 
hiding from toy policemen. The model was arranged so that only by taking each 
policeman's visual point of view could the children decide where the toy children 
should hide. Four- and five-year-olds succeeded even when they had to coordi¬ 
nate the points of view of two policemen whose views of the scene were different 
from their own. In teaching, we look for the optimal way to make principles 
learnable. In research, we should also spend as much time as possible working 
out procedures that will allow us to get the best possible results. 

Assume that your school district will look into the value of multilingual/multi¬ 
ethnic education. All kinds of performance measures have been collected, but 
you also wish to assess the attitudes of relevant personnel to such programs. The 
relevant personnel are children enrolled in the programs (kindergarten through 
grade 4), teachers, administrators, parents, school board administrators, and 
community leaders. Assume that you have worked out a needs-press analysis 
questionnaire that works well for the adults. (In needs-press analysis the re¬ 
searcher first asks respondents to list and rate the most important needs for a 
program or whatever. Then, after the respondents have taken part, or observed, 
the program, the researcher asks how well each of the needs were met. The 
comparison of perceived needs and need satisfaction make up the study.) Obvi¬ 
ously, such a procedure will not reliably tap the attitudes of the children. One 
possibility would be to use the happy-face method where children are asked how 
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tiicv ft cl about X or how they feel when .V happens. They are presented with an 
array of happy to unhappy faces to point to. Whatever data-gathcring technique 
you use with young children, you will want to pilot the technique several tirr.es 
until you are certain that you are getting reliable results. 

We often use questionnaires as a way of gathering data, lime requirements and 
task boredom can discourage people from responding. To insure a better re¬ 
sponse rate on return of questionnaires, it is important to consider exactly what 
information must be obtained from the respondent and what information can as 
easily be gathered from other sources. For example, there is no need to ask stu¬ 
dents to list their GPA, SAT, or GRE scores if this type of information can be 
more reliably obtained from other sources. The longer and more complicated the 
questionnaire, the less chance of return. 

Of course, the best suggestion we can give is that you consider the data collection 
procedures used by researchers who have carried out studies similar to your own. 
The method that you use to collect data will undoubtedly be influenced by that 
used by previous researchers. If you wish to elicit monologue data, you might 
follow Chafe's (I9S0) model and have people view The Pear Stones film and tell 
the story back as you tape. Or you might use a silent film such as the montage 
of extracts from Charlie Chaplin's Modern I imes prepared by project researchers 
working in Heidelberg on the European Science Foundation Project fa project 
studying second language acquisition of adult immigrants in Europe). Of course 
you would need to review each film story carefully in order to understand how it 
might shape the choice of structures used in the narrative monologue. Or you 
could copy l.abov's (1972) "danger of death" technique to elicit narratives. For 
investigations of grammar, you might want to use grammatically judgments, 
sentence repetition, or translation. For vocabulary, you could adopt a card sort 
method to test the outer limits of core vocabulary items (e.g., pictures of various 
boxes, bags, chests, baskets, etc., to see where Ss move from box to some odiei 
lexical items or pictures of cars, vans, station wagons, jeeps, VW's, limousines, 
etc., to see where .Ss move car to another lexical item). You might use a spew test 
where Ss "spew" out as many examples of a core vocabulary item (or words that 
rhyme or start with the same sound or whatever) as possible in a given time pe¬ 
riod. To study communication components described as foreigner talk," you 
may tape teachers in beginner classes or set up dyads where one person has :n- 
: formation (e.g., how items are arranged in a doll house or how geometric shapes 
arc arranged on a grid) which must be conveyed to another person who cannot 
see the arrangement. To investigate the reading or the composing process, you 
may ask 5s to "think aloud" as they work. To check comprehensibility of input, 
you may ask 5s to listen to one of their taped interactions, and tell you as accu¬ 
rately as they can just what was happening-a "retrospection" method. To study 
speech events, you may set up role-play situations where 5s return defective items 
to a store (a complaint situation), issue invitations to parlies (inviting and 
accepting rejecting invitations), give/accept advice, offer/rcceive compliments, 
and so forth. The range of possibilities is never-ending. The important thing is 
that the procedures work for you and give you data that can be used to answer 
your research question. To find lists of tests which purport to measure particular 
constructs, consult Buros (1975), Assessment Instruments in Bilingual Education 
(1978), or other test guides in the reference section of your library. Previous re- 
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search that relates to your research questions will, however, give you the best 
ideas as to the most appropriate methods to use to gather data. Once you have 
completed the literature review for the study, you should be well versed in the 
variety of techniques previously employed, be able to select from among them, 
or be ready to offer even better alternatives. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 1.7 

1. What data sources are already available at your university or institution?_ 


2. The two methods (GSR and eye-blink) of examining semantic associat ions we 
mentioned are rather exotic. What other ways might the data be collected? 
Compare these suggestions in your study group. What advantages or disadvan¬ 
tages do you see for each? 


3. List another way (other than "happy face") to gather data on the attitudes of 
young children toward an instructional technique. Compare the suggestions in 
your study group. What possible problems do you see with these techniques? 


Would the methods you have listed above work to elicit data on attitudes of adult 
nonliterate ESL learners? Why/why not? 


4. Most language departments and ESL programs have placement exams. What 
personal information is requested at the beginning of the test? This is "question¬ 
naire data." Why is the information requested? What interesting research 
questions can be answered on the basis of this information? _ 
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Select one research topic from your research journal. First, be suie that urn 
have limited the scope of the question, have stated hypotheses and written oper¬ 
ational definitions for key terms. Give two methods you might use to gather ex¬ 
igence to support or reject your hypotheses. 

a._ 


b. 


Review these procedures in your study group. What suggestions were made to 
improve your data gathering techniques? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Internal and External Validity of Studies 

Researchers talk about threats to both internal and external validity of their 
studies. To distinguish between these two, Campbell and Stanley (1963) note 
that Internal validity has to do with interpreting findings of research within the 
I study itself. External validity, on the other hand, has to do with interpreting 
findings and generalizing them beyond the study. 


Internal Validity 

Some common threats to internal validity include subject selection, maturation, 
history, instrumentation, task directions, adequate data base, and test effect. 

We assume, when we collect data to answer questions, that we will be able to 
meet all threats to internal validity. Sometimes, that is not easy to do. Imagine 
that you wished to replicate the study conducted by Ben-Zeev (1976), which 
showed mixed results. In one part of the study she checked to see if bilingual and 
monolingual children had the same flexibility in recognizing that there can be 
other words for concrete objects such as honk, table, cat. (Or, like Mark Twain s 
Eve, do they believe that the. symbol is the thing?) In one sample she had 98 
Hebrew-English bilingual and English monolingual children. In a second sample, 
she had 188 Spanish-English bilingual and English monolingual children. The 
bilinguals outperformed the monolingual in the first sample but there was no 
difference between bilinguals and monolinguals in the second sample. You want 
to sec what would happen with a new sample. 
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To start the study, you collect data at a school, carefully checking that school 
records show which children are bilingual and which are monolingual English 
speakers. When you were on the playground, though, you notice that some of the 
"monolingual" children actually speak some Spanish with each other. Your data 
arc compromised by poor subject selection, a major threat to internal validity. 
Not all children in the monolingual sample were truly monolingual. Until each 
threat to internal validity is checked, you will not know how much confidence you 
can place in the results. Let's consider some of the most common threats to 
internal validity in more depth. 


Subject Selection 

It is important that you carefully identify the subject characteristics relevant to 
your study and that the subjects match that description. If your 5s are 
monolingual and bilingual students, then the 5s selected must match your oper¬ 
ational definition of each category. 

Selection bias can also occur if there are preexisting differences between groups 
of 5s. You may wish to compare the effect of some particular teaching technique 
for two groups of 5s, but it is impossible if the two groups are not really equiv¬ 
alent at the start. To escape this threat to internal validity, all relevant subject 
characteristics must be listed and checked for group bias in the selection process. 

In planning subject selection, you should also think about the potential for attri¬ 
tion. This is especially important in longitudinal studies. Does the area or the 
school district you work in have a stable student population? Another type of 
attrition is especially important when you are comparing instructional programs. 
Say that your university wished to increase the number of underrepresented eth¬ 
nic minority students admitted to the university. A sample admissions test was 
given to 10th grade students intending to apply for college admission. Letters 
were sent to those students whose English language skill scores were low, inviting 
them to attend a special summer program that would advise them on ways to 
improve their chances of college admission. Fifty-five such letters were sent and 
23 students volunteered for the program. If all 55 of the students actually applied 
for college and you wanted to compare those who had and those who had not 
volunteered for this course, the comparison might not be "fair." There is always 
a chance that differential attrition might occur during this time period. Might the 
very best students from either of the groups decide not to apply after all? Or the 
very weakest students? Is it likely that more attrition might occur in one group 
or another? Differential attrition (also known as mortality) can have important 
consequences to the outcomes in the research. 

It is important, then, to take care in selecting people (or, in text analysis, pieces 
of text; or, in one-person analysis, pieces of speech--i.e., utterances). This is often 
called the subject selection factor. "Subject" is a conventional way of talking 
about the people who participate in our research. Abbreviations you will see are 
5 for subject, 5s for subjects, and 5's and 5s' for possessive forms. 
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[Maturation 


Another factor to consider is maturation. Maturation relates to time. In addition 
to getting older, it includes getting tired, getting bored, and so forth. Sometimes 
we hope to show a relation between special instruction anil improvement in lan¬ 
guage skills. It is possible that exactly the same results would have been obtained 
simply because learners have matured. 


History 

In addition to maturation, many other things could be happening concurrent with 
the research. Though you may be unaware of many of them, there is always the 
possibility they could affect results. These are called history factors. 

Imagine you teach Japanese at your university. The department wants to carry 
out an evaluation of a new set of materials (and subsequent change requirec in 
teaching methodology). Unknown to the department, you have instituted an in¬ 
formal Friday afternoon gathering for your students at a Japanese Sushi bar. 
You have also arranged for "conversation partners" by pairing students with 
Japanese families you know. The families invite the students to their homes for 
informal conversation, and some of your students go on shopping and cultural 
:field trips with their "adopted families." If the department carries out the evalu¬ 
ation. and you do not mention these concurrent activities, the results of the re¬ 
search will not be valid. 


iOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 1.8 

1. In the study of "monolingual" children who also speak Spanish, would your 
study be compromised if it were only a matter of two of the children who knew 
Spanish? Would it matter if these two really knew only a few words of Spanish 
and that they used these only on the playground? How might you solve this 
problem?_ 


2. Studies frequently mention subject attrition. For example, data were collected 
from 1,067 junior high Spanish-English bilingual children on a reading test, a 
spelling test, and a vocabulary test. Twelve students were absent for the reading 
test, 17 for the spelling test, and 91 for the vocabulary test. The report includes a 
comparison of the students' performance on the three tests. Compare this study 
with that of the 10th grade minority students mentioned in the discussion of 
attrition (page 34). In which case would attrition be a potentially more damaging 
factor? Why? 
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3. Imagine that in the evaluation of the Japanese teaching materials you began 
the partners arrangement prior to any pretest evaluation. The gain scores from 
pretest to posttest arc to be used in the evaluation. Is it still important that you 
inform the evaluators, about the arrangement? Why (not)?_ 


4. Select one research idea from your journal. What is the best way to select 5s 
or texts for the study? 


Is it possible that the findings for the study could be compromised by the subject 
selection?_ 


What history factors might influence the results? 


How might maturation influence the results? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Instrumentation 

It is important that the test instrument or observations used in research are both 
valid and consistent. The type of evidence that you collect to support or reject 
your hypotheses will depend, in part, on the validity of your operational defi¬ 
nitions of the key terms in your research. Those definitions must be valid or it 
will be difficult to persuade others of their appropriateness. 

For example, if you operationally defined the abstract construct acquisition as 
accurate performance (80% level or higher on a written test) of students on 10 
grammar points, and then presented the findings to teachers of these students, 
they might question your findings. First, they might argue that language acqui¬ 
sition is something more than performance on 10 grammar points. Second, their 
course syllabus may place heavy emphasis on oral language and, again, they 
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might question findings based on a written test of acquisition. Third, they may 
remember the 20% error performance of their students and wonder if the crite¬ 
rion level in your operational definition reflects acquisition. Your narrow opera¬ 
tional definition may not match their broader definition of acquisition. 

Since there is always variability in performance (even for native speakers), you 
m jght argue that 80% is a good level for defining what is or is not acquired. How 
do you feel about studcnLs who reach the 79% level—do you believe they have 
not acquired the structure? Do the people who group in 0%-79% behave in 
similar ways as non-acquirers and the 80%-100% in another way as acquirers? 
How much confidence would you have in the validity of any claims made about 
acquirer/’non-acquircr groups? These are questions the researcher must consider 
and justify to give validity to operational definitions. 

To be valid, a measure must truly represent what it purports to represent. 
Teachers and learners often question the use of highly integrative tests, such as 
the cloze test (where students supply every fourth or fifth word in a text) or an 
error detection test (where students identify parts of sentences that contain er¬ 
rors), as a measure of the construct proficiency. Imagine you arc a language 
teacher who believes in teaching discrete grammar points or that you work with 
materials which are built around a grammar syllabus. If research defines profi¬ 
ciency in terms of more global tests, it is unlikely that you would accept them as 
adequate measures of proficiency. Conversely, if you are a language teacher who 
believes in communication activities, it is unlikely that you would accept a 
multiple-choice grammar test as an adequate measure of proficiency. 

Since constructs arc abstract (constructs such as proficiency, motivation, self- 
esteem, or academic success), particular care is needed in evaluating operational 
definitions. Often, we forget our objections to such definitions as we read the 
discussion and application sections of reports. For example, if need achievement 
is measured by persistence in a ring-toss game, and you think this is a poor 
measure of need achievement , you would not accept the operational definition and 
yet, in reading the results, you might accept the findings anyway because the 
operational definition of "need achievement equals ring-toss" is forgotten. As 
consumers of research (and as researchers), we need to keep the operational de¬ 
finition in mind as wc evaluate the results of research reports. 

Suppose that you wished to test the claim that there is a language of academic 
success which is abstract and "decontextualized," and that children who do not 
have this "decontextualized" form of language in their LI or L2 are doomed to 
academic failure. While "decontextualized" language has never been defined, 
there are a number of features which could be attributed to it. If you chose one 
of these features (e.g., the use of indefinite articles for "new" information) and 
examined the language of young children for this feature, you could only discus? 
the findings in terms of that one feature, not in terms of contextualized o 
decontextualized language. The discussion should, then, talk about this feature 
explain that it is one part of the puzzle and list, perhaps, other facets that might 
be measured on the contextualized/decontextualized continuum. 
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C'lmsiiuci validity, then, has tu du with wheihct ot not the piece.** of data col¬ 
lected to represent a particular construct really succeed in capturing the con¬ 
struct. Because so many constructs (e.g., acculturation, field dependence, 
motivation) are extremely difficult to define, construct validation is a field of sta¬ 
tistical research in itself. The problem such research attempts to solve is whether 
or not a set of items can be established that define a construct. 

There are many factors other than construct validity that influence internal re¬ 
search validity. In particular, we should examine the method we use to gather the 
data. Our procedure forgathering the data can affect the validity of the research. 


Task Directions 

If instructions are to be given to the people who participate in research, these in¬ 
structions must be carefully planned and piloted. This is especially true in the 
case of directions given to language learners. Sometimes, after carrying out re¬ 
search. vve discuss findings only to discover 5s could have done the task accu¬ 
rately if they had understood exactly what was expected. 11 the instructions are 
not clear, the results are not valid. 

Length of instruction is also important. A recent questionnaire distributed by a 
university began with two single-spaced pages of instructions. The questions re¬ 
quired a code number response and the definitions of the codes were embedded 
in these instructions (not near the questions). If the results of this report (pur¬ 
porting to measure faculty interest in instruction) are to be valid we would need 
to believe that faculty have both time and motivation to read such directions, 
further, vve would need to believe that such directions can be easily kept in 
memory or that 5s will Hip back to consult the code numbers as they work 
through the questionnaire. Neither assumption appears warranted. 

These same instructions assured the respondents that the research was "anony¬ 
mous," yet questions about years of service, sex, age, salary, department, and 
college were included. If anonymity were important to respondents, we could 
rightfully question the validity of the data. 

Instructions, however, need not be aimed at the subject. There are also in¬ 
structions for the person(s) collecting the data. As you collect observations in the 
classroom or administer a particular test, you may think of new ways to make the 
procedure clearer to your 5s. While the changes may make the task clearer and 
more pleasurable for all, the result is that the data may also change and no longer 
be internally valid. When this happens, the best thing to do is to consider the 
data collection to that point as a pilot study. You now have a better data col¬ 
lection procedure and you can start again, maintaining consistency in the col¬ 
lection procedure. 

When you have a large project, it is often necessary to have assistants help in the 
data collection process. You must periodically check and retrain assistants to 
make sure the test instrument itself and the collection procedures remain con¬ 
stant. 
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.ydequait’ Data Base 


another problem regarding validity of measures has to do with the number of 
times a particular item is observed or tested. For example, one research project 
tested the relation between age and comprehension of the distinction between V 
is easy 'o V and X is eager to V. The test for this distinction was as follows. The 
child was presented with a blindfolded doll and asked, "Is the doll easy to see' 1 " 
\ "ves" response was counted correct. If the child responded no,” the next di¬ 
rection was "show me the doll is easy to see." It is unlikely that this single item 
could provide a valid measure of the structure. Of course, other researchers noted 
this and worked out alternative data collection methods where many examples 
of the structure were tested. 

If you have multiple items to measure one thing, those items should be carefully 
arranged. Assume you decided that you needed 30 items to test the easy eager 
to V construction. As respondents react to the first two or three items, they set 
up expectations as to how they are to perform on the rest of the items. I his is 
called forming a set response. 

Consider some of the ways you might test the relation of age and pronunciation 
of a second language. You've been teaching Spanish to students in a home foi 
the elderly. To lessen the pressure of the research, you ask them just to push a 
red button when the words (minimal pairs) sound different and to do nothing if 
they are the same. However, with anxiety heightened, some of the students have 
trouble inhibiting the motor response once it is set up. One way to get around the 
set response is to give several filler items in between the test items. Since other 
responses are required, the response pattern can be broken. 

To be valid, the data-gathering procedure should allow us to tap the true abilities 
of learners. If the tasks wc give 5s arc too difficult, people give up before they 
really get started. Placing easy tasks first will give Ss the initial success they may 
need to complete the procedure. On the other hand, if beginning items arc very 
simple, respondents may become bored. If they do not finish the task, the results 
again will not be valid. In questionnaire research, we suggest that demographic 
information (personal information regarding the respondents) be placed at the 
end of the questionnaire. People are not usually bored giving information about 
themselves. When these questions come first, 5s often answer them and then 
quit. You're more likely to get full participation if other data are collected first 
and personal information second. And, of course, 5s arc more likely to return a 
questionnaire that is finished than one that is not. 

To be sure that the order of items does not influence results, you might want to 
order items randomly in several ways to obtain different forms of the same task. 
Alternatively, you might reverse the order to obtain two forms and then test to 
see if order was important. 
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Test Effect 


Another factor that may influence the validity of research is test effect. If the; 
research begins with a pretest, that test can affect performance during the treat¬ 
ment and on future tests. The test alerts students as to what teachers expect them ■ 
to learn. The pretest, then, can influence the final outcome. Note that the actual 
form of items decays very rapidly in memory while the content of items remains. 
(So, if your research is on grammatical form rather than content, the pretest may: 
not. have a strong influence on a subsequent posttest.) 

Test effect is especially important if tests arc given within a few hours or even 
days after the pretest. For this reason, many researchers try to create parallel test 
forms so that the pretest and posttest are parallel but half the group receives form 
A as the pretest and form B as the posttest and the other half receives form B as 
the pretest and form A as the posttest. You must be VERY sure that the test 
forms arc equivalent (or know where the differences are between the forms) be¬ 
fore you try this! Otherwise, you may wipe out any gains that might have been 
there. 


ooooooooooooooooooooooooooooooooooooo 

Practice 1.9 

1. Select one abstract concept (e.g., self-esteem, field independence, proficiency, 
communicative competence). Give an operational definition used for the concept 
in a published research report. Do you believe the definition reflects the total 
concept? What parts are or are not covered? Can you suggest a better measure? 


2. Designers of profiency tests are concerned with giving test directions to Ss who 
may not be proficient in the language. It is important that the directions not be 
part of the test! Review the directions given on tests used in your program. How 
might they be improved? If the test was validated on the basis of previous test 
administrations, would a change in directions mean these figures need to be re¬ 
considered? 


We expect that test: items will show a range of difficulty. For the tests you ex¬ 
amined. is there an attempt to place easier items toward the front of the test? If 
not, on what grounds would you recommend a change (or no change)?_ 
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■ imagine you wanted to compare LSL 5s' ability to use old new information 
is a factor in deciding on relative acceptability of separated anil nonseparated 
.'or ms of two-word verbs (c.g., call up your friend, call your Jnend up vs. call him 
up. call up him ) in written text. Mow many items with old new information would 
vou suggest for such a study? Mow might you design the study to avoid a set 

response? - 



4 , Look at the ways you have decided to collect data for your own research. 
Check to see if the measures seem to meet the requirements of adequate opera¬ 
tional definitions, good task directions, adequate amount of data, and task ar¬ 
rangement. Discuss the problems you see with your study group partners and list 
vour suggestions for improvement._ 



5. In your study group, select one research question. Write out the task di¬ 
rections and a few sample items. Select two appropriate Ss (with attributes that 
match the intended 5s of the study) and pilot the directions. Compare your 
perceptions as to the appropriateness of the task directions following this pilot. 
List below the suggestions to improve directions. 



OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


There are many other threats to internal validity that could be discussed. As you 
will discover in carrying out your own projects, they can be never-ending. If yoj 
have begun with strong operational definitions of terms that are involved in your 
research, you will have formed a basis for dealing with many of these threats be¬ 
fore they occur. Consistency in data collection and valid subject selection come 
with a carefully designed study. 


External Validity 

.lust as there are factors which influence internal validity of a study, there are 
factors which influence external validity. First, if a study does not have internal 
validity, then it cannot have external validity. We cannot generalize from the 
data. For internal validity, we worry about how well the data answer the re¬ 
search questions from a descriptive standpoint for this specified data set. When 
wc want to generalize from the data set, wc are concerned not only with internal 
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validity but external as well—how representative the data are for the group(s) to; 
which we hope to generalize. We need to overcome the threats to external validity? 
so that we can generalize, can make inferential claims. 


Sample Selection 

We often hope to be able to generalize the results of our studies to other circum-; 
stances. To do this we must start with a detailed description of the population 
to which we hope to generalize. This is the population from which we draw our: 
random sample. For example, we may test a group of university-level ESL stu¬ 
dents and hope that we can interpret our results as applicable to other ESL stu-S 
dents. However, if the group has been selected by "convenience sampling" (they 
happened to be the group that-was available), no such generalization is possible. 
Sample selection is central to our ability (or inability) to generalize the results t6; 
the population we have specified. We cannot generalize anything from the results 
unless we have appropriate subject selection. If we have specified the population: 
as ESL university students everywhere, then our sample 5s must be selected to 
represent ESL university students everywhere. 

One way to attempt to obtain a representative sample is via random selection 
In random selection every candidate (whether a 5, a piece of text, or an object) 
has an equal and independent chance of being chosen. This can be done by as- i 
signing every candidate a number and then drawing numbers from a hat (or by 
using a table of random numbers which you can find at the end of many statistics 
books and in some computer software programs). The problems with this method; 
are many. For example, it may be the case that there are more immigrant stu¬ 
dents in ESL classes than there are foreign students. There may be more men 
than women in such classes. There may be more Spanish than Farsi speakers in 
such classes. There may be more students from the sciences than from the hu¬ 
manities. These differences are important if the sample is to be truly represen¬ 
tative of the group to which you hope to generalize results. The way to solve this, 
once you have determined which factors are important, is to create a stratified 
random sample. You decide ahead of time what portion of the sample should be 
male/female, immigrant/foreign, and so forth. All possible candidates are tagged 
for these characteristics and are then randomly selected by category. 

Imagine that you wanted to look at the use of agentless vs. regular passives in 
written text. But what is a representative sample of written text? There are many 
types of genres and within each genre (e.g., narrative) there are many possible 
sources of written text (c.g., news accounts of happenings, science fiction stories, 
folktales, and so forth). One possibility for obtaining a representative sample of 
written text would be to use the Brown corpus and take a stratified random se¬ 
lection of words (the number to be based on the proportion of the total word 
count in each category) from randomly selected samples of each of the categories 
identified in the corpus (e.g., novels, social science, science, and so forth). 

As you might imagine, subject selection with random stratified selection proce¬ 
dures can be an art that is especially crucial to large survey research projects 
(whether the survey is of people or of texts). If you would like to see how exten- 
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! s j VC this procedure is, you might want to consult your sociology department. 

;political science, psychology, health, and social welfare departments also com¬ 
monly engage in large-scale survey research. Your research group might iike to 
invite faculty engaged in such research to speak about methods they use in se¬ 
lecting subjects. 

Consider the following s:tuations. (1) You have a list of elementary schools in 
school districts in your state. You need thirty such schools to participate in your 
research. You begin to select schools randomly, contacting them to secure per¬ 
mission for the research. When you have 30, you stop. Have you obtained a 
random sample of schools? No, you have not. You have begun to develop a pool 
of volunteer schools from which the final random selection can be made. (2) You 
want to use community adult school students as the data source for your re¬ 
search. You want to avoid using established classes because you cannot get ran¬ 
dom selection. So. you advertise in the school paper for volunteer 5s. The early 
volunteers (the first 30 who call) become your 5s. Do you think you have selected 
a random sample? Again, you have not. You have begun a pool of volunteer 
15s from which a random selection might be drawn. Notice, however, that in both 
these examples, the sample is made up of volunteers. Volunteers may not (or 
may) he representative of the total population. 

Hopefully, these paragraphs point out the difference between a random sample, 
random selection, and a representative sample. You can achieve a random sam¬ 
ple if everyone or everything has an equal anti independent chance of being se¬ 
lected. When, as in the case of people, .Vs must agree to be part of the sample, 
it is important first to get a large random sample of people who do agree anti then 
make a random selection from that group for the final sample. It is also impor¬ 
tant to determine whether people who agree to participate arc representative of 
the population to which you hope to generalize results. 

Let's briefly compare the importance of subject-selection for both internal and 
external validity. Imagine that you collected data from ESL students in your 
program. The 5s are from three different proficiency levels (beginning, interme¬ 
diate, and advanced.) If you want to compare the data to see if there are differ¬ 
ences among these three groups in your program, it is important that all the 5s 
represent the level to which they are assigned. If the 5s in the study are not 
representative of these three levels, then claims about differences in levels cannot 
be made with confidence. That is, there is a threat to interned validity. However, 
if this threat to internal validity has been met, the results are valid for these par¬ 
ticular three groups of 5s. Statistical procedures, in this case descriptive statistical 
procedures, can be used to test differences in levels. 

Then, imagine that you want to say something about ESL (beginning, interme¬ 
diate, and advanced) students in general. Not only do you want to compare the 
data of your 5s across these three levels, but you want to generalize from these 
5s to others. However, it is unlikely that you will be able to generalize from three 
■intact classes to all beginning, intermediate, and advanced students. The students 
and the levels in your particular program may not be representative of such stu¬ 
dents in programs elsewhere. There are a number of ways to address this prob¬ 
lem of external validity. One is to use random selection of 5s within the study. 
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There are times, however, when you dnuh: that straight random selection of 5s 
will give you a representative sample-representative of the group to which you 
hope to generalize. Perhaps the subject pool at your university consists primarily; 
of Chinese 5s from the PRC (People's Republic of China) while you know that 
nationwide, the proportion of Chinese 5s would be much less. The purpose of 
selecting a stratified random sample is to assure that all 5s in the population are 
proportionally represented in the sample Though researchers sometimes do 
make comparisons of proficiency levels, LI membership, sex, or other factors 
taken into account to obtain the stratified sample, the intent of stratified random 
sampling is to obtain a sample that represents the population to which wc hope 
to generalize. If threats to external validity are met, this will be possible. To 
accomplish this generalization, the researcher will use inferential statistical pro¬ 
cedures. 

Descriptive and inferential statistical procedures are presented in this manual. 
It is not important at this point that you know which are which but rather that 
you understand that statistics can be used to describe a body of data. It is im¬ 
portant that the data have met all the threats to internal validity so that the de¬ 
scription be accurate. In addition, statistical procedures can be used to infer or 
generalize from the data. To do this, a detailed description of the population to 
which you will generalize must be given, 5s must be randomly selected using that 
description, and threats to internal anti external validity must be met. Otherwise, 
no generalizations can be made. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 1.10 

1. Assume that you wanted to generalize the results of a study to ESL university 
students everywhere. List the characteristics the sample would need to include 
to be representative of this group. Compare your list with those of your study 
group partners. Determine which characteristics would be crucial and which not. 


2. Review the example of a text search for agentless vs. regular passive. Imagine 
that you decided to limit the search to advice-giving contexts (e.g., recipes, Dear 
Abby, "how-to" books on gardening, enrollment instructions in college catalogues) 
because you believe that procedural discourse will contain large numbers of pas¬ 
sive constructions. How would you select a representative sample for the study? 
Compare the tags suggested by all the members of your study group and prepare 
instructions for the selection of samples._ 
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3. Av-ume you need 30 5s and there are two classes which have more than 30 
students and these classes arc taught by friends of yours so there would be no 
problem of gaining access to the classes. You worry about such convenience 
sampling, though, and decide to advertise in the paper. You get exactly 30 re¬ 
sponses to the ad from people willing to participate for a small fee. As a study 
group assignment, decide which method of obtaining 5s would be best. Hew 
might your selection affect your research outcome? Will you be able to generalize 
from the sample selected? Why (not)?_ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Although this chapter is titled "Defining the Research Question," we have given 
a great deal of space to the discussion of research validity. Imagine that you de¬ 
fined a research question and then discovered that your study dues have a threat 
to validity. Of course, that is undesirable. But exactly what happens? What is 
it that is so awful? Consider the results of your threatened study: the data are 
untrustworthy. They contain "error." You cannot generalize from the results 
because tlie threats to validity caused the data to be, somehow and to some de 
■gree. wrong. You may have a maturation effect, and the posttest score yon 
thought was due to learning was in fact due to simple maturation of the subjects 
(which is "error" as far as you arc concerned-you are not studying maturation). 
There may be other factors that you did not consider, and thus your posttest is 
affected by more than you intended (again, "error" as far as you are concerned). 
You may discover that your 5s were unique and that you cannot reliably gener¬ 
alize to other situations, and so your posttest is not accurate (i.c., it contains er¬ 
ror"). 

Now imagine thaf your study is excellent; you have controlled for all possible 
threats to its design validity. However, imagine that you suffer from threats to 
measurement validity. For example, you may have inadvertently used a grammar 
posttest when you were really intending to measure reading; this is a mea¬ 
surement validity problem (and, you guessed it, a source of "error"). Or, you may 
be giving a posttest on the effects of instruction but you neglected to check the 
test carefully against the syllabus of the instruction; this is a measurement valid¬ 
ity problem also (and it. too, adds "error" to that posttest). You could have a 
posttest that is simply bad-the developers of the test paid little attention to test 
construction; this is a threat to measurement validity (and, yep, more "error"). 

In fact, any realized threat to validity creates some "error" in the data description 
and analysis of the data. A good way to visualize this relationship is to consider 
the following two diagrams: 
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LITTLE threat to validity in the study: 



The concept is that any measurement you take ("total score") is composed of what 
you want to measure ("true score") and what you do not want to measure ("error 
score"). And, any threat to validity, be it to the design of the study or to the 
measurement instrument itself, increases error score. 

Finally, have you ever wondered why test preparation courses are so popular? 
For example, in the United States, there are companies that do nothing but train 
people how to take law school exams, medical board tests, and so on. Those 
specialty schools do not teach the content of the exams; they do not teach the true 
score. Rather, they teach test-taking strategies and train for extensive familiarity 
with these major tests. In so doing, they reduce the examinees' error score. Now, 
if the examinees know the material already and if their true score would be high 
in the presence of little error, then you can see how these test-taking service 
companies make money: they help people reduce error score, display more of their 
true score (which is high since they know the material) and achieve a higher 
overall total score. What an elegant business concept! 

Of course, if you go to a medical certificate training class and you do not already 
have medical training, no amount of "error score reduction training" is going to 
help you. And that is one of the first things such trainers will say, in the very first 
hour of the very first training class. 
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OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


practice /.// 




I. Think about one of your research questions. List two possible threats to the 
validity of the research project. Draw a "true/error" diagram to show how much 
you might improve your project by working out ways to eliminate these threats 
to validity or reliability. 



2. If you have taught a course for students planning to take the 101.11. exam, 
how much time was spent on content? How much on types of exam questions? 
How much on general test-taking tips? 1 or whom was the course most or least 
effective? Why? 



OOOOOOOOOOOOOOOOOOO^OOOOOOOOOOOOOOOOO 


In this chapter, we have begun to think about the research process as a systematic 
search for answers to questions. We have begun to offer you some of the ways 
researchers have solved problems that confront every researcher-how to define 
the scope of research, ways to make research more efficient or feasible, and con¬ 
siderations of internal and external validity that make or break our research ef¬ 
forts. In the next chapter, we will consider a systematic approach to identifying 
and describing variables. 


Activities 

1. Below are brief summaries of five studies. On the basis of the summaries, 
write null hypotheses for each. 

D. L. Shaul, R. Albert, C. Colston, & R. Satory (1987. The Hopi Coyote story 
as narrative: the problem of evaluation. Journal of Pragmatics, I/, I, 3-25.) 
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tested Lie notion th:it, since stories have the same job to do the world over, nar¬ 
ratives may be structured in the same way across languages. I he authors show 
that the llopi Coyote stones do not contain an evaluation section (a section which 
specifically points out why the story is told and thus validates the telling). I he 
moral of the story is also usually unstated. I he authors give examples and pos¬ 
tulate reasons for this difference in narrative structure. 

.1. Becker & P. Smenner (1986. The spontaneous use of thank you by preschoolers 
as a function of sex, socioeconomic status, and listener statu 1 '. Language in So¬ 
ciety. 15, 537-546.) observed 250 3:6 to 4:6 year old children as they played a 
game with their teachers and then received a reward from cither an unfamiliar 
peer or adult. Girls offered thanks more often than boys, and children from 
iower-ir.come families thanked more often than those from middle-income fami¬ 
lies. Adults were thanked more frequently than peers. 

H. Warren (1986. Slips of the tongue in very young children. Journal of 
Psycholinguists Research, 15, 4, 309-344.) Slips of the tongue (e.g., "with this 
wing I thee red") have long intrigued researchers. This study analyzed data of 
children in the 23 to 42 month age range. The number of slips made by the 
children was compared with the frequency of slips in the speech of adults 
(mother, teacher) with whom they interacted. Children made very few slips of the 
tongue. 

M. Crawford & R. Chaffin (1987. Effects of gender and topic on speech style. 
Journal ojPsycholinguist ic Research, 16, I, 83-89.) Men and women were asked 
to describe pictures that had previously been shown to interest men or women, 
or to be of neutral interest to both. The picture descriptions were coded for re¬ 
cognized features of "women's language" (WL). There was no difference between 
male ar.d female 5s' use of these features in the descriptions. While male-interest 
piclutes elicited more language than the other pictures, there were no differences 
otherwise. The authors speculate that the features of WL may actually be related 
to communication anxiety for both genders. 

A. Pauwels (1986. Diglossia, immigrant dialects, and language maintenance in 
Australia: the case of Limburg and Swabian. Journal of Multilingual and Multi¬ 
cultural Development, 7, I, 513-530.) studied language maintenance of German 
and Dutch immigrants to Australia. Some 5s spoke standard Dutch and German 
and others also spoke dialects (Swabian for German and Limburg for Dutch). 
There was no substantial difference in language maintenance of these groups al¬ 
though Limburg dialect speakers maintained only their dialect and not the stan¬ 
dard variety. Second, language use patterns were examined to see how migration 
changed use of the first language. 

2. Select three of these studies. From the descriptions given here, identify the 
key terms which would require an operational definition to allow others to repl> 
cate the study. Write an operational definition for each key term. 

3. Having completed I and 2 above, read one of the three articles you selected. 
Compare your hypotheses and operational definitions with those given in the ar¬ 
ticle. I.)o you believe the author(s) did a good job in stating the research 
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questions and hypotheses and in offering operational definitions? Do you believe 
you have written clearer (or more adequate) hypotheses and definitions? 

4 . Authors usually present the broad research question in the introduction sec¬ 
tion of reports. The actual study may be much narrower in scope than the ori¬ 
ginal general question. Select one of the above articles (or, even better, one 
related to your own research interests) and list, first, the broad research topic 
and, second, the actual question asked (or reported on) in the article, llow 
completely does the actual report answer the broad research question? Does the 
author suggest other research that could be carried out to answer the broad re¬ 
search question more fully? If so, what suggestions were made? If not, what 
suggestions do you have? 

5 . In the study you selected to answer item 4, what threats co you sec to internal 
and external validity? Was subject selection an important issue in the research? 
Could maturation have accounted for any of the findings? What history factors 
do you think might have been involved? Does the author generalize from the 
study? If so. how were the 5s selected? Are they truly representative of the group 
to which the author generalizes? If the setting for the study was tightly controlled 
(as in a laboratory), do you feel comfortable about generalizing to situations 
outside this setting? 

6 . For those studies you actually read, note the general layout of the reports. 
How similar different are they? What evidence is presented in each case? What 
types of visuals (charts, graphs, pictures) arc used to present the evidence? Mow 
effective do you think ihc report is in stating the question, in explaining how 
systematically the search for answers was carried out. and in presenting the evi¬ 
dence for answers? 
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Chapter 2 

Describing Variables 


• Research variables 

Variable vs. level 

• Measurement of variables 

Nominal scale variables 
Interval scale variables 
Frequency data vs. score data 

• Function of variables 

Dependent and independent variables 
Moderator and control variables 
Other intervening independent variables 


Research Variables 

In chapter l, we mentioned that wc can expect variability in anything we observe. 
An ESI. student's language skill may vary from week to week. You may be able 
to account for this variation in individual performance by considering amount of 
instruction. Skill docs not remain constant. The ability of a group of American 
students learning Cantonese to recognize and reproduce the tone system may 
vary. You may be able to account for this variability by determining whether the 
students have learned other tone languages, whether they are young or old, male 
or female, or, perhaps, suffering from hearing loss. Different pieces of text may 
vary in frequency of "hedges." Academic science texts may include many more 
lexical hedges ("it appears," "it seems") to certainty of claims than other types of 
text materials. Variability and explanations of that variability are central to re¬ 
search. 

A variable can be defined as an attribute of a person, a piece of text, or an object 
which "varies" from person to person, text to text, object to object, or from time 
to time. 

One characteristic on which human performance varies is the ability to speak a 
variety of languages. Some people are monolingual, others are bilingual, and 
others multilingual. If we wanted to investigate multilingualism, we could code 
the number of languages spoken by each S. The number would vary from person 
;to person. Notice here that the variable multilingualism is defined in terms of the 
number of languages spoken. Performance behavior in each of these languages 
also varies. If our research question asked about performance levels, then, with 
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an operational definition of performance , we could investigate a variable, L2 
performance , because this is a variable on which people vary. 

Variables can be very broad or very narrow. For example, the discourse, se¬ 
mantic. syntactic, phonological elements of language are attributes of language. 
They arc also something attributed to people in varying degrees of proficiency. 
A variable such as phonological system is broad, indeed, when assigned to 5s. 
The variable rising tone is less so. The broader the variable, the more difficult it 
may be to define, locate, and measure accurately. 

Variables can be assigned to groups of people as well as individuals. For exam¬ 
ple, the variable bilingual can be assigned to a society as well as to an individual. 
The variable is defined by its place in the research project. 

In a study of written text, text type might be a variable. Pieces of text vary not 
only by type but in many other special characteristics. For example, different 
texts may vary in length, so length could also be a variable that one might exam¬ 
ine in text analysis. Which characteristics are relevant will depend on the re¬ 
search questions. 

Objects may be important in applied linguistic research and. of course, they too 
can vary in such attributes as color, size, shape, and weight. For example, in a 
test of reading, letter size or font style might be varied so that size or font would 
become variables. Color might be used in illustrations which are used to teach 
vocabulary. Color (vs. black and white) would be a variable on which the objects 
(illustrations) varied. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 2.1 

1. Imagine you wished to relate language proficiency and personal traits of 
learners. List two personal traits that vary across learners that might be included 
in the study. 


How wide a range of variability is there within each of these variables? 


2. Assume you wished to relate language proficiency to traits of particular soci¬ 
eties. List two variables attributed to societies that could be used in the research. 


How wide a range of variability is there for these two variables? 


3. Imagine you wished to relate the use of personal pronouns (e.g., we, you, I, 
they ) to text characteristics. List two variables of texts which might be investi¬ 
gated in the research._ 
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f-Jow wide a range is possible for each of these variables'.’ 


4 Suppose that you were interested in descriptive compositions. You asked 
students to write descriptions of various objects. List two variables attributed to 
objects that could become part of the study of descriptive writing._ 

ffovv wide is the range of variation on these two "variables? 


5 . Select one of your own research questions. List the variables for the research 
question and the range of variability you might find for each. 


ooooooooooooooooooooooooooooooooooooo 

Variable vs. Level 

(n a research project, we inay wish to look at levels within a variable, l or ex¬ 
ample. we might want to know how well LSI. foreign students are able to do 
some task. The variable is l.SL student. That variable may be divided into levels 
for the purposes of the study. If the study were designed to compare the perfor¬ 
mance of ESL students who are foreign students with those who are immigrant 
students, the variable would have two levels. If the study concerned geographic 
area, a student, might be subclassificd as South American, European. Middle 
Eastern, or Asian so that comparisons among these levels of ESL student coulc 
be made. The variable would consist of four levels. Or, for the purposes of the 
study, we might want to divide the variable ESL student into proficiency levels 
isuch as advanced, intermediate, and beginner. The variable would then have 
three levels. 

In our definition of variables, we can limit the range of levels or expand them 
For example, in object shape, we can say that a shape either is or is not a triangle. 
The range is yes'no (+ or - triangle), two levels. Wc can enlarge the scope to 
include various shapes of triangles, in which case triangularity is no longer yes. no 
but no and all the individual triangle forms, several levels. Again, we can narrow 
the variable to right triangle and the range, once more, is yes/no. 

If we included bilingual as a variable in a research project, we might say that 
people either arc or are not bilingual (yes/no levels). The matter of proficiency ir. 
each language would not be an issue in such a study. The research question, 
however, might subclassify the bilingual 5s on their proficiency in the second 
language using school labels such as FEP (fluent English proficiency), LEP (lim¬ 
ited English proficiency), and NEP (no English proficiency), three levels of the 
variable bilingual. The selection of levels, as with the identification of variables, 
depends on the research question. 
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ooooooooooooooooooooooooooooooooooooo 


Practice 2.2 

1. It is possible that age might be a variable within a research project which 
wishes to relate attainment in the 1.2 to age of arrival in the country where hat; 
language is used. How would you set levels within the age variable so that com¬ 
parisons among groups could be made? Justify your decision. 


2. In school research, class status might be a variable. It might not be necessary 
to draw comparisons for every grade level from kindergarten through grade 12. \ 
Imagine that you wished to compare performance of children or. a i 
Spanish English spelling test. How many levels would you use for the class var- I 
iable? Why? 


3. In Practice 2.1, you listed variables for your own research. Define the levels 
within each variable below. 


- j 

_ -I 

j 

ooooooooooooooooooooooooooooooooooooo 

As mentioned in chapter 1, definitions of variables are very important in planning i 
research. In addition to identifying levels of a variable that are to be contrasted, i 
the operational definition needs to include the way the variable is measured and 
the function of the variable in the research project. The function and the mea¬ 
surement of variables are important because they determine exactly what kinds ; 
of statistical tests will be appropriate in analyzing the data. 


Measurement of Variables 

As we have just seen, the range along which an attribute varies can be small (e.g., 
a person either docs or docs not speak Arabic--ye.v>m) or very wide (e.g., the word 
length of pieces of text could vary from one to infinity). When variables arc of 
the all-or-nothing, yes/no sort, we cannot measure how much of the variable to 
attribute to the person, text, or object. We can only say that it is or is not present. 
Variables will be quantified in different ways depending on whether we want to : 
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j, n o\\' how often an attribute is present or how much of the variable to attribute 
t0 the person, text, or object. 


Nominal Scale Variables 

nominal variable, as the name implies, names an attribute or category and 
classifies the data according to presence or absence of the attribute. While you 
can use a yes/no notation in recording this information, it is customary (but not 
required) to assign an arbitrary number to each possibility instead. So, if the 
variable is native speaker of French, a I might represent yes and a 2, no. The 
classification numbers have no arithmetic value ; they cannot be added or multi¬ 
plied. If we tally all the Is, we will have a frequency count of how many French 
speakers there arc in the sample. If we tally all the 2s, we will have a frequency 
count of the number of non-French speakers there arc in the sample. (We are 
not adding Is or 2s but tallying the number of persons in each group.) 

Nominal variables do not have to be dichotomies of yes or no. For example, the 
nominal variable native language in a study of English native speakers vs. non- 
native speakers might be I NS and 2 = NNS, but it also could be 1 = English. 
2-Spanish, 3 = Cantonese, 4 = Mandarin, 5= Arabic, 6 Italian, and so forth. 
Again, the numbers are codes to represent levels of the nominal variable and have 
no arithmetic value. 

It's easy to see that an attribute could be a yes/no variable in one research project 
and a level of the variable in another. I or example, if the study includes student 
status as a research variable, this could be entered as I = yes (student), 2 no. 
iIn another study, student status might be entered as I-elementary school. 
2 secondary school,- 3 - undergraduate, 4 -graduate. 

■The numbers assigned to represent levels of a nominal variable have no arithme¬ 
tic value. An average language of 3.5 (based on I = English, 2 = Spanish, 3 
: = Cantonese, 4 = Mandarin, 5 Arabic, and 6 - Italian) makes no more 
sense than an average status of students of 2.25 (based on I - elementary, 2~ 
preparatory, and 3 = secondary). When we begin to tally the number of people 
who speak French and the number who 1 do not, or the number of Canadian, 
Mexican, or European 5s in a sample, these frequencies do have arithmetic value. 

Frequency tallies are very useful in certain cases (e.g., when we want to know 
how many people participated in some study, how many were males or females, 
how many fell into the age categories of 19 and below, 20-29, 30-39, 40 and 
above, how many people enroll or do not enroll their children in bilingual educa¬ 
tion programs). In other cases, especially when we want to compare groups, 
open-ended tallies are a problem. For example, assume that you coded every 
clause ir. sample pieces of text so that all clauses are coded as 0 or I (yes, no) for 
relative clause, nominal clause, comparison clause, purpose clause, and so forth. 
The computer would tally these data so that you would know how many of each 
clause type appeared in the data. Imagine that you wanted to know whether 
samples drawn from different genres showed different numbers of certain types 
of clauses. However, the samples differ in the total number of clauses so a direct. 
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meaningful comparison cannot be made. You might change the data into p t0 . 5 
portions so that you would show what percent of the total number of clauses in ] 
each genre were purpose clauses and so forth. In this case you have converted j 
simple open-ended tallies of nominal variables into closed-group percentages. I 
Comparisons can then be drawn across samples from different genres. 

! 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO j 

Practice 2.3 

j 

l. A research project has been proposed to evaluate the English compositions of i 
5s who have taken a composition course linked to a content course with those 'j 
who took English composition separate from their content courses. What is the j 
subject vatiable in this project? How many levels of the variable arc there? Why ' 
is this classified as a nominal variable?_ 


2. The foreign student office on campus wants to consider the distribution of- 
majors in the foreign student enrollment. How many levels of the variable major 
would you suggest for such a project? How does the variable qualify as a nominal- 
variable?__; 


3. As a research project, you hope to show that student writing changes in ; 
quality depending on what type of writing task students are given. If you obtain 1 
writing samples from students on four different tasks, how many levels of task 
are there? Does task qualify as a nominal variable?_ 


4. Look at the variables you have defined for your own research. Which qualify 
as nominal variables? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Ordinal Scale Variables 

Sometimes we have identified variables but have no easy way of developing a 
measurement to show how much of the variable to attribute to a person, text, or 
object. For example, if the variable is happy and this is not to be treated as a 
nominal variable in the study, we need some way of measuring degree of happi-; 
ness. There is (to our knowledge) no reliable method of measuring precisely how: 
much happiness one possesses at any moment. (Do those machines in fast-food 
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restaurant lobbies measure happiness-you know, the ones where you iuseit your 
thumb and the little arrow moves over into the red zone?) However, wc can say 
that a person is very unhappy-unhappy-happy—very happy. These can be as¬ 
signed numbers from I to 4. In this case, the numbers do have arithmetic value. 
Someone with a 4 is happier than someone with a I. The 4, however, does not 
say how much happiness to attribute to a person. Instead, it places the Ss in a 
rank order with respect to each other. Persons rated 4 are ordered higher than 
those with a 3, and those with a 3 higher than those with a 2, and so forth. 
Ordinal measurement, then, describes a rank order measurement. The rank order 
can be of two sorts. First, one could take all 5s and rank order them in relation 
to each other so that with 100 5s the rank order would be from i to 100. Another 
possibility is to rank order 5s on a scale. Each 5 is placed on the scale and then 
all 5s who rate 5 arc higher than the group rated 4, and so forth. Each 5 can 
be ordered in relation to others, then, in two ways—first, an absolute rank; and, 
second, a ranking of persons who score at a similar point on a scale. In the first, 
each individual is ranked and in the second, groups arc rank ordered. 

While degree of happiness is seldom measured in applied linguistics studies, wc 
do sometimes want to know students' attitudes towards particular lessons or sets 
of materials. In our happiness variable, we only gave four points to the scale. 
Most researchers who use scales prefer to use a 5-point, /-point or 9-point scale. 
The wider range encourages respondents to show greater discrimination in their 
judgments (so responses don't cluster right above the middle point on the scale). 
You might like to use a Likert-typc scale {after the researcher who first proposed 
such scales) to tap student attitudes. I o do this you would set up a set of state¬ 
ments such as: 

The lessons were boring 1 2 3 4 5 6 7 

and ask students to mark the strength of their agreement with each statement (7 
being strongly agree , and I being strongly disagree). 

While it is true that the scales in ordinal measurement have arithmetic value, the 
value is not precise. Rather, ordinal measurement orders responses in relation to 
each other to show strength or rank. That is, a person ranked 4 is not precisely 
twice as happy as one ranked 2. Nor is the increment from 1 to 2 necessarily 
equal to that from 2 to 3 or from 3 to 4. The points on the scales, and the num¬ 
bers used to represent those points, are not equal intervals. This points out an¬ 
other reason researchers prefer to use a 5-point, 7-point, or 9-point scale. The 
hope is that wider scales encourage more precision in rating and thus approach 
equal intervals. For example, a 5 faced with a Likert 9-point scale is more likely 
to think of it as an equal interval statement of rating than a 4-point scale labeled 
disagree, neutral, agree, strongly agree. 
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ooooooooooooooooooooooooooooooooooooo 


Prm tice 2.4 

1. Your research might require that you ask teachers to give judgments of overall! 
student performance. What label would you assign this variable? How would; 
you label the points on the scale? 


2. In the absence of any precise measure, you might need to estimate the degree 
of acculturation of students. What labels would you use to define the points on 
the scale?_ % 


3. Examine the variables you have listed for your own research. How many of j 
the variables could be measured using an ordinal scale? How would you label the j 
points on the scale? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO | 

Interval Scale Variables j 

Like ordinal scale measurement, interval scale data tell us how much of the vari- j 
able to attribute to a person, text, or object. The difference is that the mea- i 
surement is much more precise. The intervals of measurement can be described, i 
Each interval unit has the same value so that units can be added or subtracted. 

To demonstrate the difference between ordinal and interval scales, let's consider | 
the measurement of pauses in conversations. It is extremely tedious to time every i 
pause in conversation,so many researchers annotate pause length with +, - +, i 
and + + +. These might be coded on an ordinal scale as 1, 2, and 3. However, j 
if each pause is timed in seconds, then the seconds can be added to find total ; 
pause length. With such absolute interval scaling, you can give the average pause j 
length in seconds while an ordinal "average" of 2.5 (ordinal + + .5) makes less ; 
sense. 

In equal-interval measurement, we expect that each interval means an equal in- I 
crement. This is true for absolute interval measurement, such as seconds, days, i 
and so forth. For example, in measuring age, each year adds 365 days. However, : 
in reality, you know that time does not equal aging, so that the difference of one ; 
year may actually be a very large interval and the difference of another year quite i 
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minor - in the same way, test scores are considered to be equal interval yet (unless 
test items are appropriately weighted) this may not be the case. The difference 
[n intervals between 1 and 10 in a test of 100 items might not be the same as the 
intervals between 90 and 100. In planning research, we need it) think about just 
how equal the intervals of the measurement are. 

By now it should be clear that the way you measure variables will depend in part 
on the variable itself and its role in the research, and in part on the options 
available for precision of measurement. In the operational definition of a vari¬ 
able, then, it is important that the researcher plan precisely how the variable is 
to be measured. There arc often very good reasons for opting for one type of 
measurement over another. It is not always the case that more precise interval 
measurement is a better way of coding a variable. The measurement should be 
appropriate for the research question. For example, if we want to classify 
bilingualism according to the languages spoken, then nominal measurement is 
appropriate. 

There arc times when you may not feel confident that the interval data you have 
collecteo is truly equal-interval. Say that you had administered a test and then 
later felt little confidence in the test. You might, however, feel enough confidence 
that you would be able to rank students who took the test on the basis of the re¬ 
sults. If you aren't sure that a person who scores a 90 on a test is 8 points "better" 
than a person who scores an 82, you might still be willing to rank the first student 
as higher than the second. Transforming the data from interval to rank order 
measurement, however, means that some information about differences is lost. 
You no longer know how many points separate a person ranked 3()th from a 
person ranked 31st or 35th. 

Whenever we convert data from one type of measurement to another, we should 
eaicfully consider the grounds for the decision. Wc will discuss these issues in 
much more detail later. For now, you might think about such decisions in re¬ 
lation to a balance between confidence in the accuracy of measurement vs. the 
potential for information loss or distortion in the conversion. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 2.5 

1. Look at variables you have identified for your own research. Which of these 
are interval scale variables? How are they measured?_ 


► 2. In many studies in the field of applied linguistics bilingualism is a variable. 
Part of the operational definition of that variable will include the values that code 
the variable. Bilingualism may be coded as I =yes, 2 no, or as 1 = 
French F.nglish, 2 = German/French, 3 = French/Arabic, 4 

Cantonese Mandarin, 5 = Spanish/Portuguese. In this case the variable has 
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been scaled as a(n) _ variable. Each number represents a 

(level/variable). Bilingualism might be coded as 1 = very limited, 
2 = limited, 3 = good, 4 = fluent, 5 = very fluent. In this case, the variable-. 

has been measured as a(n)_variable. Bilingualism could be coded; 

on the basis of a test instrument from 1 to 100. The variable has then been.; 
measured as a(n)_variable. 

3. Assume that you gave a test and later decided that for the purposes of your ; 
research you would recode the data so that Ss who scored 90 + are in one group* 
those who scored 80-89 in a second group, those who scored 70-79 in a third- 
group, and those below 70 in a fourth group. What information would be lost in 
this transformation? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Frequency Data vs. Score Data 

Another way to think about difference in measurement is to consider whether the 
study measures how much on an interval or ordinal scale or whether it measures | 
how often something occurs~the frequency of nominal measurement. For most | 
novice researchers, it is easy to identify a nominal, ordinal, or interval mea- \ 
surement, but somehow the distinction becomes blurred in moving to studies j 
where nominal measurement is now discussed in terms of frequency, and ordinal | 
and interval measurement are grouped together as showing how much of a vari¬ 
able to attribute to a S. This distinction is important because it will determine j 
the appropriate statistical analysis to use with the data. 

Consider the following examples drawn from unpublished research projects of j 
ESL teachers. 

Lennon (1986) w'anted to develop some ESL materials to help university students 
learn how to express uncertainty in seminar discussions. However, he could find 
no descriptions that said how native speakers carry out this specific speech func- ! 
tion. In his research project, then, he tape-recorded seminars and abstracted all 
the uncertainty expressions uttered by te.achers and students. He categorized 
these expressions into five major types plus one "other" category. First, the sub¬ 
ject variable is status with two levels. The variable type is nominal. If he as¬ 
signed numbers to the levels of a variable, these numbers would have no 
arithmetic value. Assume that a 1 was arbitrarily assigned to one level and a 2(l 
to the other. If he added all the Is for this variable, he would obtain the frequency M 
for the number of students in the study. If he added the 2s, this would be the 
frequency for the number of teachers in the study. 

The second variable in the research is uncertainty expressions. It, too, is a nomi¬ 
nal variable. If Lennon found six hedge types in the data, he could assign num¬ 
bers to identify each of the six hedge types. The numbers would represent the six 
levels. The total number of instances of each would be tallied from the data. 
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OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


practice 2.6 

► 1 . To be certain that the distinction between frequency counts and 
interval,ordinal scores is clear, work through each of the following examples. 
Compare your responses with those of your study group partners. 

a. Brusasco (1984) wanted to compare how much information translators could 
give under two different conditions: when they could stop the tape they were 
translating by using the pause button and when they could not. There were a set 
number of information units in the taped text. What are the variables? If the 
number of information units are totaled under each condition, are these frequen¬ 
cies or scores (e.g., how often or how much)? 

Group consensus:_ 


b. Li (I9S6) wondered whether technical vocabulary proved to be a source of 
■difficulty for Chinese students reading science texts in English. He asked 60 
.Chinese students to read a text on cybernetics ami underline any word they 
weren't sure of. Each underlined word was then categorized as ±iechnical. Each 
.V. then, had a percent of total problem words which were technical in nature. 
Among the 60 students, 20 were in engineering, 20 physics, and 20 geology. What 
are the variables? Are the data frequencies or intcrval/ordina! scores? If Li had 
compared the number (rather than the percentage) of ± technical words, would 
your answers change? 


c. The Second Language Research Forum is a conference traditionally run by 
graduate students of applied linguistics. Papers for the conference are selected 
by the graduate students. The chair of the conference wondered if M.A. and 
Ph D. students rated abstracts for the conference in the same way. Papers were 
rated on five different criteria using a 5-point scale (with 5 being high). Each 
paper, then, had the possibility of 0 to 25 points. What are the variables? Are 
the data frequencies or interval/ordinal scores? 


d. Since her ESL advanced composition students seemed to use few types of co¬ 
hesive tics, Wong (1986) wondered if they could recognize appropriate ties (other 
than those they already used). She constructed a passage with multiple-choice 
slots (a multiple-choice doze test) for cohesive ties. The ties fell into four basic 
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types (conjunctive, lexical, substitution, ellipsis). Her students were from seven 
major LI groups. What are the variables? Do the measures yield frequencies or 
scores? 



e. Hawkins (1986) taped the interactions of paired American and foreign students 
as they solved problems. She wanted to know how often Americans and foreign 
students showed they did not understand each other. What are the variables? 
Will they yield frequency or score data? 



f. Since some of the problems were easier than others, Hawkins wanted to know 
if the number of communication breakdowns related to task difficulty. She clas¬ 
sified five of the problems as easy and five as difficult. What are the variables? 
Do the measures yield frequency or score data? 


ooooooooooooooooooooooooooooooooooooo 

I he difference between nominal variables that yield frequency data and ordinal 
and interval variables that yield score dat.i is sometimes identified as noncontin¬ 
uous vs. continuous measurement (or discrete vs. continuous measurement). 
Continuous data are data scored along either an ordinal or interval scale. Son- 
continuous data are not scored but rather tallied to give frequencies. Nominal 
data, thus, may be referred to as noncontinuous. Discrete and categorical are 
other synonyms for nominal that you may encounter in research reports. To 
summarize: 

Frequency data show how often a variable is present in the data. The data 
are noncontinuous and describe nominal (discrete, categorical) variables. \ 

Score data show how much of a variable is present in the data. The data 
are continuous but the intervals of the scale may be either ordinal or interval 
measurements of how much. 

ooooooooooooooooooooooooooooooooooooo 

Practice 2.7 

1. In your study group, discuss how each of the above teacher research projects 
is (or is not) "interesting" in terms of the research agenda of applied linguistics. 
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Can the results be applied to theory construction or theory validation? Can they 
he applied to practice (e.g., curriculum design, classroom methodology, materials 
development)? Can they be applied to both? 


2. Select one of the questions you have defined for your own research. List the 
variables below. Identify variable type (nominal, ordinal, interval) and state how 
the data will be measured. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Functions of Variables 

lo understand how the variables in a study relate to one another, we need to be 
able to identify their functions. These functions grow out of the research ques¬ 
tion. I hat is, the functions depend on the connection we believe exists between 
the variables we have chosen to study. Sometimes we expect that two variables 
are connected or related to one another although we do not view one variable as 
affecting or causing a change in the other. (Correlation is an example of a s:a- 
tistica. procedure that tests relationships.) In other cases, we describe one major 
variable and then observe how other variables affect it. (Analysis of variance is 
an example of a procedure that tests effect.) It is especially important to under¬ 
stand the function of variables in order to select the most appropriate way to 
analyze the data. A conventional way of classifying the function of variables is 
to label them as dependent or independent (and subtypes of independent vari¬ 
ables such as moderating or control variables). 


Dependent Variable 

The dependent variable is the major variable that will be measured in the re¬ 
seat eh. For example, if you wanted to study the construct communicative com¬ 
petence of a group of students, then the dependent variable is the construct and 
it might be operationalized as your students' scores or ratings on some measure 
of communicative competence. The measurement would be part of the opera¬ 
tional definition of communicative competence for your study. I his variable 
(communicative competence) is the dependent variable. We expect performance 
on the dependent variable will be influenced by other variables. That is. it is 
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''dependent" in idaiioii to olhei \ ambles in (tie study. Let's consider two more 
examples. 

Assume you wanted to know how well LSL students managed to give and accept 
compliments, one small part of communicative competence. Their videotaped 
performances during role-play were rated on a 10-point scale (10 being high) for 
giving compliments and a 10-point scale for receiving compliments and bridging 
to the new topic. The student's total score could range from 0 to 20. The rating, 
again, might be influenced by other variables in the research. The rating mea¬ 
sures the dependent variable compliment performance. 

In a text analysis, you hypothesized that use of English modals may be influenced 
by types of rhetorical organization found in texts. Each place a modal occurred 
in the text it was coded. Tallies were made of overall frequency of modals and 
also of each individual modal (e.g., can. may. might, should). The dependent 
variable would be modal and the levels within it might be the actual modal forms 
or. perhaps, the functions of modals (c.g., obligation, advisability, and so forth). 
The frequency of the dependent variable, modal, might be influenced by other 
variables included in the research. 


Independent Variable 

An independent variable is a variable that the researcher suspects may relate to 
or influence the dependent variable. In a sense, the dependent variable "depends" 
on the independent variable. 1 or example, if you wanted to know something 
about the communicative competence of your students, the dependent variable is 
the score for communicative competence. You might believe that male students 
and female students differ on this variable. You could, then, assign sex as the 
independent variable which affects the dependent variable in this study. 

In the study of compliment offers.receipts, LI membership might influence 
performance. LI , then, would be the independent variable which we believe will 
influence performance on the dependent variable in this research. There might 
be any number of levels in the independent variable, LL 

In text analysis, rhetorical structure may influence the use of modal auxiliaries. 
Rhetorical structure (which might have such levels as narrative, description, 
argumentation) would be the independent variable that affects the frequency cf 
the dependent variable. 


0000000000000000000 < 00 < 000000<0000000000 

Practice 2.8 

1. In the study of communicative competence, what attributes of 5s (other than 
sex) might serve as independent variables? Why would these be important to the 
study? 
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2. In the study of compliment offers/receipts, what other independent variables 
do you think might influence variations of performance on the dependent vari¬ 
able? _ 


3 . In text analysis, different genres of text might influence the frequency of use 
of modal auxiliaries. What different genres would you include as levels of this 
independent variable?_ 


4. For your own research questions, list the variables and give the function (de¬ 
pendent, independent) for each. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Moderator Variable 

Sometimes researchers distinguish between major independent variables and 
moderating independent variables. For example, in the study of compliments, 
you might believe that sex is the most important variable to look at in explaining 
differences in student performance. However, you might decide that length of 
residence might moderate the effect of sex on compliment offers receipts. That 
is, you might believe that women will be more successful than men in offering and 
receiving compliments in English. However, you might decide that, given more 
exposure to American culture, men would be as successful as women in the task. 
In this case, length of residence is an independent variable that functions as a 
moderator variable. 

While some researchers make a distinction between independent and moderator 
variables, others call them both independent variables since they influence vari¬ 
ability in the dependent variable. However, specifying variables as "independent" 
and "moderator" helps us study how moderating variables mediate or moderate 
the relationship between the independent and dependent variables. 
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Control Variable 


A control variable is a variable that is not of central concern in a particular re¬ 
search project but which might affect the outcome, it is controlled by neutraliz¬ 
ing its potential effect on the dependent variable. For example, it has been 
suggested that handedness can affect the ways that 5s respond in many tasks. 
In order not to worry about this variable, you could institute a control by in¬ 
cluding only right-handed 5s in your study. If you were doing an experiment 
involving Spanish, you might decide to control for language similarity and not 
include any speakers of non-Romance languages in your study. Whenever you 
control a variable in this way, you must remember that you are also limiting the 
gcncralizability of your study. For example, if you control for handedness, the 
results cannot be generalized to everyone. If you control for language similarity, 
you cannot generalize results to speakers of all languages. 

If you think about this for a moment, you will see why different researchers ob¬ 
tain different answers to their questions depending on the control variables in the 
study. Comby (1987) gives a nice illustration of this. In her library research on 
hemispheric dominance for languages of bilinguals, she found researchers gave 
conflicting findings for the same research question. One could say that the re¬ 
search appeared inconclusive-some claimed left hemisphere dominance for both 
languages while others showed some right hemisphere involvement in cither the 
first or second language. Two studies were particularly troublesome because they 
used the same task, examined adult, male, fluent bilinguals, and yet their answers 
differed. The first study showed left hemisphere dominance for each language for 
"late" bilinguals. The second showed the degree of lateralization depended on age 
of acquisition; early bilinguals demonstrated more right hemisphere involvement 
for LI processing than late bilinguals, and late bilinguals demonstrated more 
right hemisphere involvement for L2 processing than early bilinguals. 

The review is complicated but, to make a long story short, Comby reanalyzed the 
data in the second study instituting the control variables of the first study. Bi¬ 
linguals now became only those of English plus Romance languages, handedness 
was now controlled for no family left-handed history; age was changed to one 
group-5s between 20 and 35 years of age; and Finally, a cutoff points for "early" 
vs. "late" bilingual were changed to agree with the First study. With all these 
changes, the findings now agreed. 

It is important, then, to remember the control variables when w'e interpret the 
results of our research (not to generalize beyond the groups included in the 
study). The controls, in this case, allowed researchers to see patterns in the data. 
Without the controls, the patterns were not clear. At the same time, using con¬ 
trols means that we cannot generalize. Researchers who use controls often repli¬ 
cate their studies, gradually releasing the controls. This allows them to see which 
of the controls can be dropped to improve generalizability. It also allows them 
to discover which controls most influence performance. 

In the above examples, we have controlled the effect of an independent variable 
by eliminating it (and thus limiting generalizability). The control variables in 
these examples are nominal (discrete, discontinuous). For scored, continuous 
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variables, it is possible statistically to control for the effect of a moderating vari¬ 
able. I hat is, we can adjust for preexisting differences in a variable. (This pro¬ 
cedure is called ANCOVA, and the variable that is "controlled" is called a 
'covariGte. ANCOVA is a fairly sophisticated procedure that we will consider 
again in chapter 13.) As an example, imagine we wished to investigate how well 
male and female students from different first-language groups performed on a 
series of computer-assisted tasks. The focus in the research is the evaluation of 
the CAI lessons and the possible effect of sex and LI group membership on task 
performance. In collecting the data, we would undoubtedly notice that not all 
students read through the materials at the same speed. If we measure this pre¬ 
existing difference in reading speed, wc can adjust the task performance scores 
taking reading speed into account. Notice that this statistical adjustment "con¬ 
trols" for preexisting differences in a variable which is not the focus of the re¬ 
search. Unlike the previous examples, the variable is not deleted; i.e., slow 
readers (or rapid readers) are not deleted from the study. While reading speed 
may be an important variable, it is not the focus of the research so, instead, its 
effect is neutralized by statistical procedures. 


Other Intervening Independent Variables 

We often hope to draw a direct relation between independent and dependent 
variables in our research. For example, wc might want to look at the relationship 
between income and education. We would expect that with additional education, 
income would increase. If you collected data, you might be surprised to find that 
the relationship is weak. Additional education might increase the income of some 
people and not help others. How can we explain the lack of a direct relationship 
between additional education -» increased income? If you think about it for a 
moment, you can sec that education is likely to increase the earning power of 
young people. They might earn the minimum wage at McDonald’s while in high 
school and earn a much larger salary at IBM after college. So the increase in 
income is great. On the other hand, additional education is not likely to increase 
the income of older adults. Their salaries arc already fairly high and the value 
of added classes may not be reflected in income. So for one age group the relation 
is that of additional education -» higher income, but for the other group this is 
not the case. There is an intervening variable, a variable that was not included in 
the study, at work: 


additional education 


.-young adults - 

"‘"'■bolder adults 
intervening variable 


increased income 


As you can guess from this diagram, an intervening variable is the same thing as 
a moderating variable. The only difference is that the intervening variable has 
not been or cannot be identified in a precise way for inclusion in the research. 
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In planning research, we want to be able to identify all the important variables 
(or control for them). However, sometimes this is impossible. For example, in¬ 
tervening variables may be difficult to represent since they may reflect internal 
mental processes. For example, when we talk about LI -► L2 transfer or LI 
L2 interference, we are talking about an internal mental process that we may or 
may not be able to measure accurately. Intelligence and test-taking talents may 
not be directly measurable yet play some role in changing research outcomes. If 
you review page 46, you will see that intervening variables arc a source of "error" 
in our research. 

In all research, we can only account for some portion of. the variability that we 
see in the major, dependent variable. We may look at the influence of many 
different independent variables to explain the variability in the dependent vari¬ 
able. However, there are many other factors which may or may not be important 
that we may fail to consider and that thus contribute to "error." However, 
whatever the Findings, it is important to consider whether we have defined our 
variables well so that they reflect (as well as they possibly can) all the processes 
that we hope to tap. 


ooooooooooooooooooooooooooooooooooooo 

Practice 2.9 

► 1. To be sure that you have correctly grasped the definitions of variable func¬ 
tions, please review each of the research descriptions on page 60. For each study, 
identify the function of each variable (as dependent, type of independent, or in¬ 
tervening). 

a. Lennon's study of uncertainty expressions. 


b. Brusasco's study of simultaneous interpreters. 


c. Li's study of technical vocabulary. 
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j. Study of ratings of SLR1 abstracts. 


c. Wong's study of cohesive tics. 


f. Hawkins' study of communication breakdowns. 


2. Identify the functions of the variables for your own research question. In yot.r 
study group, check to be certain that ail members agree on the identification of 
these functions. 

■Group consensus:_ 


ooooooooooooooooooooooooooooooooooooo 

In the examples we have given in this chapter, the distinction between dependent 
and independent variables has been straightforward. In some research projects 
it is clear that we expect certain independent variables (such as proficiency level 
or major) to influence the dependent variable (such as performance on some test). 
There are, however, times when the assignment of one variable to dependent 
status and another to the function of independent variable is not so obvious. For 
example, wc might expect that 5s' scores on a Spanish language placement exam 
might be similar to their scores on a Spanish language proficiency exam. Or.c 
variable does not really influence the other; rather, wc expect to see a relationship 
between the two sets of scores. Variable I relates to variable 2 and vice versa. 
The relationship is bidirectional. In such cases, the assignment of dependent and 
independent status is arbitrary. 

In this chapter we have talked about variables and the functions of variables in 
research. Now that wc have identified and categorized the variables in the re- 
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sc;i;di pinjcu, wc will considei how these vaiiabies can be investigated witliin a 
research design. 


Activities 

1. Review the study summaries presented in the activities section of chapter 1. 
For three of these studies, list the variables involved in the study and label each 
for its function in the study and according to how it was measured. 

2. P. McCandlcss & H. Winitz (1986. Test of pronunciation following one year 
of comprehension instruction in college German. Modern Language Journal, 70 , 
4, 355-362.) compared the pronunciation of American students learning German 
under two conditions. One group of 10 students received auditory input of 
German for 240 hours, 210 class hours, and 30 hours of listening comprehension 
tapes. Through the use of objects, pictures, and activities, students were taught 
to comprehend German. Grammar and reading were not taught. The second 
group of 10 students had a traditional program of equivalent time in which 
grammar, reading, and speaking were emphasized. Raters listened to each stu¬ 
dent read lists of sentences and gave each S an accent rating from excellent 
( ausgezcichnet ) to poor ( mangelhaft ), on a 5-point scale. List the variables, their 
functions in the study, and variable measurement. If you were carrying out this 
research project, what alternative methods might you use? 

3. V. A. Mann (1986. Distinguishing universal and language-dependent levels 
of speech perception: evidence from Japanese listener's perception of English 1/ 
and . r . Cognition, 24, 169-193.) shows that although native speakers of 
Japanese may be unable to identify /!/ and /r/ correctly in spoken English, they 
respond like native speakers of English to the different acoustic patterns that 
convey 1 and , r. (as if they are sensitive to difference in the vocal tract move¬ 
ments that convey /!/ and /r/). The study contains the following table that de¬ 
scribes the 5s of the study. 


Table 1. Oral English profile: Japanese Ss participating in experiment 


College Testing Experience Prior to College* 

__ (0-5 pt. scale, 5 = extensive) 



JACET 
max =120 

Koike 
max = 50 

Before 

J.H. 

Jr. High 

Sch. Home 

Sr. High 

Sch. Home 

Superior 

Inferior 

97.5 

42.6 

48.4 

39.3 

1.95 

0.77 

3.63 3.32 
3.09 2.23 

2.74 1.42 

2.32 0.45 


* Ratings computed by Ss' English professor on the basis of responses to a questionnaire. 
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: Examine the (able. Superior is "superior student" and Inferior is inferior stu¬ 
dent." J ACM and Koike stand for English language proficiency tests. Using the 
information in the tabic, write a description of the Ss' experience with English 
and their general proficiency in oral English. 

4. J. Lee (1986. Background knowledge and L2 reading. Modern Language 
Journal, 70, 4, 350-354.) replicated Carrcll's research project which showed that 
"unlike native speakers...nonnative readers show virtually no significant effects 
of background knowledge." Lee replicated the study design with three treat¬ 
ments: (1) ± a title page, picture page related to the topic, (2) ± a transparent 
lexical item which clearly revealed or obscured the content of the text, (3) r 
familiarity with the topic of the text. In Carrcll's study, Ss wrote what they re¬ 
called in the target language; in this study they wrote in the LI. Lee found a 
significant triple interaction of the treatment variables in contrast to Carrcll's 
finding of no significant effects of background knowledge. List the variables 
(with functions and possible measurement). If this were your study, what alter¬ 
native procedures might you use? 

5. S. Shimanoff (1985. Expressing emotions in words: verbal patterns of inter¬ 
action. Journal of Communication. 35, 3, 16-31.) investigated the speech act 

expressive," one of the basic speech acts of communicative competence. College 
students were paid to tape their conversations for one day. All expressives (e.g., 

I love l," I felt fantastic," "I coulda shot him!") were pulled from these natural 
data. I he expressives were categorized according to (a) expressive about self or 
other, (b) time, past or present, (c) source-person vs. object or event, and d) 

I valence-pleasant or unpleasant. Each .V, then, had data on use of 
pleasant unpleasant, present/past, speaker/other, person object or event. V 
gender did not influence the data so this variable was dropped fioin the study. 
List the variables (and functions and measurement). If this were your study, 
what alternative procedures might you use? 

6 . L. White (1985. The acquisition of parameterized grammars. Second Lan¬ 
guage Research. 1, 1, 1-17.) asked 73 adult ESL students (54 Spanish Ll and 19 
French Ll) to give grammatically judgments on the acceptability of 31 sentences. 
Thirteen of the sentences represented grammatically correct or incorrect sentences 
that exemplify the subjacency principle (a principle in Universal Grammar that 
puts restrictions on the movement of certain kinds of features, for example 
WH-words). Subjacency in English, according to Universal Grammar, constrains 
the number of "bounding nodes" the WH-word can cross. Neither French nor 
Spanish have S (S = sentence, not Subject!) as a "bounding node" while English 
does. The research then looked to sec whether Ss would transfer their Ll pa¬ 
rameter to English or show sensitivity to the parameter setting of English in their 
grammaticality judgments. The author reports the results as inconclusive but 
believes the subject merits further research. What are the variables (functions 
and measurement)? If this were your study, what alternative procedures might 
you use? 
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Chapter 3 

Constructing Research Designs 


® Determining the design 

Distinguishing the dependent and independent variables 
Confounded designs 

Repeated-measures vs. Between-groups (randomized) designs 
Mixed designs (split-plot designs) 

• Practice in drawing design boxes 
« Classification of designs 
Studies with intact groups 
One-shot design 

One-group pretest-posttest design 
Intact group—single control 
Time series designs 
Experimental studies 
Random assignment posttest 
Control group prctest-posttest 
Ex post facto designs 


Determining the Design 

In chapter 2 all of the concepts have been presented that are needed for deter¬ 
mining design. However, we will review them here as part of research design. 
In addition to determining the scope of the research question, stating the research 
question as clearly as possible, giving operational definitions to key terms, iden¬ 
tifying variables, understanding the roles variables will play in the research and 
how those variables will be observed, we need to plan the overall design of the 
project. This is important for it will help us to determine how the data will be 
analyzed. 


Distinguishing the Dependent and Independent Variables in Research 

When novice researchers approach a consultant for help in analyzing the data, 
the first question the consultant asks is "Tell me about your research. What are 
you trying to find out?" If the research question is clearly stated, the consultant 
will probably say, "So X is your dependent variable and A, B, and C. are your 
independent variables, right?" If the research question is not clearly stated, it 
may take time (which may or may not have to be paid for) to determine this first 
crucial piece of information: which variables in the research are which. 
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In chapter 2, we practiced identifying the function of variables. As a review, \\\ : 
will icentify the dependent and independent variables in the following brief de¬ 
scription of a research project. 

The university is concerned because, while many immigrant and 
foreign students enroll as science majors, few select a major in the 
humanities. As part of the admissions process, prospective students 
must score above 1200 on the SAT (Scholastic Aptitude Test). We 
believe that most immigrant and foreign students who meet this 
cut-off point do so by scoring very high on the math portion of the 
test (much higher than monolingual American students) but they 
probably score relatively low on the verbal portion of the test. We 
want to look at the math scores of American and immigrant, foreign 
students. We are also interested in comparing the math scores of 
students declaring an interest in the humanities with those planning 
to enroll in the sciences. (Wc may repeat this process for the SAT 
verbal scores or perhaps for the ratio of math to verbal scores of the 
5s in the two groups. Ultimately the information might be useful 
in the admissions process so that immigrant/foreign students inter¬ 
ested in the humanities have a better chance of acceptance.) 

Null hypothesis: There is no difference in the SAT math scores of native and 
mmn.'.tivc speakers. 

Null hypothesis: There is no difference ;n the SA I math scores of humanities 
and science students. 

Null hypothesis: There is no interaction between student language 

(native nonnative) and major area (humanitics/scicncc) on the math scores. 

Dependent variable(s): SAT math scores 

Independent variable(s): 

(1) 5 type—two levels, native vs. nonnative speaker 

(2) Major-2 levels, humanities vs. science 

Try to visualize the data of this project as filling a box. 
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:\Vhat tills the box? The data on math scores. Now visualize building compart- 
nients or cells in that box. The contents of the box in this project are subdi\ ided 
first in two sections. 


NSE NNSE 

The contents of the box have been subdivided so that the scores for native 
speakers are placed in one section and the scores of nonnative speakers in the 
other. 

Now \isualize the final division of the contents of the box: 


Humanities 


Science 


NSE NNSF 

The data for the dependent variable have now been divided into four sections. 
That in the upper left corner are the data of the native speaker humanities stu¬ 
dents. In the upper right corner are the data of the nonnative speakers who are 
humanities majors. What data are in the lower left-hand corner? The scores of 
the science majors who are native speakers. In the lower right section are the 
scores of nonnative speakers who are science majors. 

With the data partitioned, we can compare the two top sections with the two 
lower sections to find the differences between humanities and science majors. 

■ We can also look at native speakers vs. nonnative speakers by comparing the data 
in the left two sections with the right two. We can also compare all cells of the 
box individually to sec which differ from the others. 
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Confounded Designs 


Visualizing the design as placing data in boxes will help us avoid confounded 
designs. Confounded designs should be avoided if at all possible. These are de¬ 
signs where it is impossible to separate the effects of the independent variables. 
For example, suppose you wanted to know whether students in applied linguistics 
would learn the principles of different teaching methodologies depending on 
whether they not only read descriptions of the methods but also saw videotaped 
demonstrations of classes where the methods are used. Imagine you were also 
interested in knowing whether experienced ESL.'EFL teachers would understand 
the principles better than novice teachers. You assign the treatments (read + 
video demo or read only) to two different classes. The design box would look like 
this: 


Read Read + View 

The teachers have been assigned to the classes on the basis of their experience. 
One class is for experienced teachers and one for novices. 


(novice) (exper.) 

Read Read + View 

What happened? Once the data are collected, you cauld try to compare the two 
treatments. But any differences you found might be due to the teacher-experience 
variable. If you tried to compare the experienced and inexperienced teachers, any 
differences might be due to the difference in treatments. With this design, it is 
impossible to separate teacher-experiencc effect from treatment effect. 

Hopefully, you see that you cannot attribute any difference that you find in the 
degree of understanding of the methodologies to either treatment or experience. 
These two independent variables are confounded. (Confound those confounded 
variables!) Sometimes, such a design is unavoidable but, in reporting the results, 
you would definitely have to ask for further studies to discover just how the two 
independent variables are related to differences in performance on the dependent 
variable. 


Repeated-Measures vs. Retween-Groups (Randomized) Designs 

In the SAT example, you will notice that the math score of each S fell into one 
and ONLY one cell of the box. The data of a NSE who is a humanities major can 
only appear in the upper left section of the box. That of a nonnative humanities 
major can only appear in the upper right section. The comparisons will be drawn 
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: between independent groups in each of the four cells of the box. I herefote. this 
js called a between-groups (independent or randomized) design. This pattern is 
not always followed. Consider the following example: 

You have been given a Fulbright grant to teach American literature at the 
University of Helwan. Since you have never taught American literature 
to nonnative speakers of English, you are not sure just how appropriate 
your selection of short stories might be. Five different themes arc pre¬ 
sented in the stories. Since you will teach the course again in the second 
term and hope to offer the course during the summer at Alexandria Uni¬ 
versity, you want to be sure of your choices. You ask each student to rate 
the stories on a number of factors. Each S's ratings of the stories within 
a theme is totaled as that S's score for the theme. 

The ratings fill the box which is then divided into sections for ratings of the five 
themes: 


T 1 

T 2 

T 3 

J4 

T 5 


The rating of each S no longer falls into one and ONL Y one cell of the box. Each 
S gives a rating for the theme in each cell of the box. The themes will be com¬ 
pared using data collected from the same .S's. Since repeated readings are taken 
from the same .S's, this is called a repeated-measures design. This is not something 
to avoid. Rather, the distinction between independent groups and repeated- 
measures designs will determine in part the choice of an appropriate statistical 
procedure for analyzing the data. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 3.1 

► I. Decide whether the following studies involve a repeated-measures design or 
a between-groups design, or show a design which uses both. 

a. Your university is considering removing credit from ESL courses for immi¬ 
grant students though they will continue credit for foreign students. The claim 
is that the courses are remedial and give students credit for course work covered 
in high school. You survey faculty on whether or not they believe credit should 
be allowed both immigrant and foreign students. In your report you compare the 
responses of faculty who regularly have students from these groups in their 
classes with those who do not. 

Repeated-measures, between-groups, or both?_ 

Rationale: _ 


Chapter 3. Constructing Research Designs 77 




h. In cross-cultural analysis class, you ask students to watch a video and judge 
the appropriateness of small-talk behavior in each of five different episodes. I he 
students are IJ.S., Canadian, Indian, and Vietnamese. 

Repeated-measures, between-groups, or both?_ 

Rationale: __ 


c. You believe that purpose clauses (e.g., "To brown , remove the foil, and con¬ 
tinue baking for 5 minutes.") usually precede rather than follow main clauses. 
To test this claim, you search a random sample of oral data (from the Cartcrctte 
& Jones corpus, 1974, or the White House Transcripts , 1974) and a random 
sample of written data (from the Brown corpus, 1979, or your own collection). 

Repeated-measures, between-groups, or both?_ 

Rationale: _ 


d. You own an ESL school as a business. You have contacted local businesses, 
offering a course in "communication improvement" for their immigrant employ¬ 
ees. You screen each student and prepare an individual profile on pronunciation, 
informal conversational skills, and oral presentation skills. At two-week intervals 
you reassess the students' abilities in each of these areas. At the end of the 
course, each student receives his/her record of improvement. Employers, who 
paid for the course, receive a report on course effectiveness for the students as a 
group. The report compares the scores of the groups at each point throughout the 
course. 

Repeated-measures, between-groups, or both?_ 

Rationale: _ 


e. Using a 30-item questionnaire, you ask native speakers of English and non- 
native students to decide whether a single verb or a two-word verb would be more 
appropriate in a given context. Each question contains a context and then a 
choice between a single verb (such as telephoned) and a two-word verb (such as 
called up). Fifteen items appear in a fofmal context and 15 in an informal con¬ 
text. Does context type influence verb type choice? 

Repeated-measures, between-groups, or both?_ 

Rationale: _ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 
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Mixed Designs (Split-Plot Designs) 

We've noted that in between-groups comparisons, independent groups are com¬ 
pared. For example, if we ask administrators of EFL programs, teachers in EFL 
Iprograms, and students in EFL programs to complete a questionnaire reporting 
perceived value of courses as preparation for study in U.S., Canadian, or British 
universities, then the comparisons we draw arc between different groups. On the 
Othct hand, if wc suivey all the students in modern languages regarding the value 
: they placed on language vs. literature courses in their university training, then 
wc have only one group of students and wc arc drawing comparisons for the 
courses by repeated measures of that one group. 

^Sometimes, however, designs include both comparisons of independent groups 
and repeated-measures of the same group. These are called mixed designs or 
f split-plot designs. For example, imagine that you did carry out an attitude survey 
regarding the perceived value of a language skills program in preparation for 
; university training in British, Canadian, or U.S. institutions. This will be one of 
several measures that you plan to use for the purposes of program evaluation. 

You are a "foreign expert" working at three different institutes in 
China. Prior to final selection for study overseas, Chinese students 
arc given an opportunity to improve their language skills in one of 
these institutes. In each program, Chinese professors and 
Canadian, U.S., and British EFL. teachers offer a variety of courses 
to the students. You administer the questionnaire to students at 
each institute at the start of the program. You want to compare the 
responses of these three separate groups. The questionnaire re¬ 
sponses will show how valuable students in each group expect the 
program to be. 

As a review, the ll 0 for this part of the study is that there is no difference in ex¬ 
pected value" ratings of the program across the three institutes. The dependent 
variable is the "expected value" rating. The independent variable is institute (with 
three levels). The design box for the study is: 


Ratings 


Inst. I Inst. 2 Inst. 3 

While the word Ratings is outside the box diagram, the ratings are, in fact, inside 
the three cells of the box. The ratings for institute 1 fill the first cell, the ratings 
for institute 2, the second cell, and the ratings for institute 3 are in the third cell. 
So far, the design is between independent groups. 

At the conclusion of the program, you administer the questionnaire 
again. You want to compare the responses on this questionnaire 
with that given before the start of the program to see whether .Vs' 
value perceptions regarding the program have changed. 


Chapter 3. Constructing Research Designs 79 



The questionnaire is completed In ttie same .S‘s. In addition to comparing the 
responses across the institutes (between independent groups), wc also want to 
compare the responses of each .S' at time 1 with time 2 (a repeated-measures de- 
sign). 

The amended design looks like this: 


Inst. I 
Inst. 2 

Inst. 3 . 1 

Ratings Time I Time 2 

Again, the word Ratings outside the lower left-hand corner of the box tells us 
what is placed in each cell of the box. (That is. the top two cells of the box con¬ 
tain the ratings for institute 1 at time I and then at time 2, and so forth.) 

Six months after the same students have enrolled in university programs 
in Canada, U.S., or British universities, they were sent the third question¬ 
naire form to fill out. 

To compare the students' perceptions of \aluc of the program, the questionnaire 
was administered three times. The comparison uses a repeated-measures design. 

lo compare students from the different institutes, the responses are put into three 
groups. The comparison, here, is between independent groups. 

Here is the final design box for the study. 


Inst. 1_ 

Inst. 2_ 

Inst. 3_ 

Time 1 Time 2 Time 3 


In this design, the same Ss participate in both a repeated-measures and an inde¬ 
pendent groups comparison. Fairly complex designs of this sort are common in 
applied linguistics research. The type of statistical analysis of the data will differ 
according to whether the comparisons are belween-groups or repeated-measures. 
You can see why this is the case, if you thin.< for a moment about how close your 
responses on two measures might be. If you are asked twice to give your per¬ 
ception of the value of grammar instruction, it is likely that your two responses 
will be more similar than would one of your responses and my response. When¬ 
ever you compare one person on different measures (or the same measure at dif¬ 
ferent times), the performances are more likely to be similar than the 
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;performances between two people on different measures. (If by mistake you used 
a statistical test for betvvecn-groups with a repeated-measures design, the test 
could very well say that there was no difference between the groups when in fact 
a difference really exists. This gives rise to a Type / error, a concept we will ex¬ 
plain later.) 


Practice in Drawing Design Boxes 

One of the best ways to clarify a research design is to draw a box diagram (as 
we began to do in the previous discussion). The first thing to remember is that 
what goes inside the box is all the data on the dependent variable. (It is possible 
that you might have more than one dependent variable, but we will not cover 
such instances here.) The box will then be partitioned so that parts of the total 
data can be compared with other parts. When there is one dependent and one 
independent variable, by convention the levels of the dependent variable (what 
is being measured) will appear above each other on the left side of the box. The 
labels for the levels of the independent variable will appear across the cells at t ie 
bottom of the box. 

D 

1 :' 

P 

[•: 

N 

I) 

E 

N 

T 

INDEPENDENT 


In the purpose clause study above (see page 78), let's assume that the oral sample 
is from the tapes described and the written sample includes 200 words from an 
engineering lab manual, 200 words from Always Coming Home (a novel). 200 
;words from an economics textbook, 200 words from a geography book, and 200 
Iwords from a popular health science book. 

The research question for this study in null hypothesis form is: 

There is no difference in distribution of purpose clauses (either before or 
following the main clause). 

There is no difference in the distribution of purpose clauses in written vs. 
oral language. 

There is no difference in the distribution of purpose clauses between the 
five samples of written text. 
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The dependent variable is position of the purpose clause (2 levels). The inde¬ 
pendent variable is language mode with two levels (oral vs. written). 

Visualize the data as filling a box. The data inside the total box are ail of the 
purpose clauses in the data. Now, look at the first null hypothesis. We make two 
divisions in the data, subdividing the purpose clauses into those in initial position 
and those in final position. We turn to the next null hypothesis and add two 
more divisions to the box. The data for initial position are divided in two and 
all the initial clauses from the spoken data are put in one cell and all those from 
written data in the other. Then the clauses in final position arc divided and all 
final clauses from the spoken data go in one cell and those of written data in an¬ 
other. Wc have four cells. This leaves us with the final null hypothesis. How 
can we represent this subdivision? If you're not sure how to do this, read on. 

If you look at the box diagram for this study, you will see that the division of the 
data has been quite regular up to the point where the written text was subdivided 
into sample types. 


Pre-main cl. 


Post-main cl. 


Oral Written 

In the right two sections (the written data), we can draw vertical lines to subdi¬ 
vide each section into five parts for the five selections. These subdivisions repre¬ 
sent the five levels of the written level. This obviously is not a neatly balanced 
study. However, again it is typical of much research in our field. 

Let's try another example. Look at the following boxed diagram. 


Similarity 

Rating 


Can US Br NZ Aus 

Assume you have asked second language learners to listen to English speakers 
representing each of the above groups (Canadian, U.S.. British, New Zealand, 
and Australian). They are asked to judge how similar they feel their own pro¬ 
nunciation is to each of these groups on, say, a 7-point scale. Think of the data 
as the total collection of responses of all Ss. Inside the complete box we find the 
similarity ratings. In the cell labeled Can . we find the ratings of similarity with 
Canadian English. If you wanted to look at how similar the students felt their 
own pronunciation approached that of the speakers representing New Zealand 
English, you would look at the cell labeled NZ. 
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IMovv, assume that the gender of students doing the judging might make a differ¬ 
ence in their ratings. The box diagram would change to show this new division. 


Can 

us nn 

Br 

NZ_ 

Aus 

Male Female 

ooooooooooooooooooooooooooooooooooooo 

Prac tice 3.2 

> 1. Draw box diagrams for each of the studies on page 77, labeling the divisions, 
a. Faculty survey re credit for F.SL classes. 


;b. U.S., Canadian, Indian, and Vietnamese judgments of appropriateness of 
small-talk behavior. 


If you wanted to be sure that the judgments did not change across the episodes, 
how would you diagram the design? 
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c. Purpose clause positioning. 


d. Communication class improvement. 


e. To do the study of two-word verbs requires adding a third variable. This 
produces a three-dimensional box (a cube). Mark and identify each division of 
the box. 


2. Compare your diagrams with those of others in your study group. Are they 
the same? If not, how do you account for the differences? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Classification of Designs 

When you classify variables and their functions, you prepare the way for selecting 
appropriate statistical procedures for describing and interpreting your data. This 
classification of variables and their roles in your research is crucial. In addition, 
the classification of designs helps the novice researcher to consider threats to re¬ 
search validity. 

Campbell and Stanley (1963) list 16 different design types and discuss the possi¬ 
ble threats to reliability and validity inherent in each. There is no reason why 
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you should know the names of all the different types of designs or the ways in 
which they are grouped together. The names, however, should be familiar so that 
when you read a report about an "ex post facto" design, you won't throw up your 
hands in despair at "technical jargon." You will know that it is a design classi¬ 
fication and that you can always look it up in an index to refresh your memory 
if you decide that it is important to you. What is important is that you think 
about which type of design will work best for your particular research project. 

In selecting an appropriate statistical procedure, one of the first things to decide 
will be whether or not there is random selection and assignment of Ss (or of texts, 
or of objects) in the research. Designs, too, can be divided into those which do 
and those which do not show random selection and random assignment. 

Wc have already said that research gives us support for our hypotheses rather 
than proof that we are right. However, wc often do hope to establish causal links 
between variables. To make causal claims, the research must be carefully 
planned and closely controlled. For example, true experimental designs require 
random selection and, where treatments are compared, random assignment to 
groups. Sometimes, especially in classroom research, neither random selection 
nor random assignment is possible. The researcher must work with an estab¬ 
lished class of students, an intact group. Students are assigned to a class on the 
basis of scores on tests, on the basis of their compositions or oral interview skills, 
or by sell-selection (the students decide which courses to take). Sometimes 
teachers select students in order to have a homogeneous class where they don’t 
have to worry about extreme variations in student performance. Other teachers 
select students to have enough variation to allow them to use "cooperative learn¬ 
ing techniques" (where all students become the experts in some piece of every 
task). When random selection is not possible, causal claims are also impossible. 


Studies with Intact Groups 

The majority of classroom research involves the use of classes where students 
have already been assigned on the basis of some principle. This is called an intact 
group. In this research it is impossible randomly to select students to begin with. 
Even where students could be placed in one of several sections of the same level, 
the assignment to those sections is seldom landom. Students may self-select a 
section according to their timetables, and their timetables may reflect their ma¬ 
jors. For instance, undergraduate engineering students may have a set number 
of required courses, courses which they must take, and so all ESL students from 
the engineering department might be in one section of a class. In intact group 
studies, wc are unable randomly to select or randomly to assign students for re¬ 
search purposes. When random selection is required (i.e., when drawing infer¬ 
ences or generalizing to a population), the researcher should consider whether the 
design prohibits or allows for the use of inferential statistical procedures. 

It is much easier to achieve random selection when dealing with text analysis. 
However, even here, random selection may not always be possible. Sometimes 
random selection is not desired and causal claims are not at issue. For example, 
if you are analyzing the use of special linguistic features in the poetry of a single 


Chapter 3. Constructing Research Designs 85 


author, you may need to use all of the data rather than a random selection of ; 
poems of the author. In carrying out an in-depth study of one fourth-grade class, 
the class behaviors of individual students may be followed and described in great 
depth. The descriptions are of this particular author or these particular students. 
Random selection is not desired and there is no intention of making causal cla.ms 
or of generalizing the findings to all poets or fourth-grade students everywhere. 

In classroom research where researchers wish to see the effects of a 
teaching/learning treatment, the design often uses the intact group. While such j 
designs will not allow us to make causal (cause-effect) statements about the 
findings, they will allow us to give evidence in support of links between variables ; 
for these particular classes. 

Intact designs are often the only practical way of carrying out research which will 
help find answers to questions. However, as you will see below, the researcher 
must think about how much confidence to place in the findings and interpret 
those findings with care. Replication will be necessary. As we will see in a mo* 
ment. there are ways to improve designs to give us more confidence in findings 
that show a link between variables (although no causal claim will be made re¬ 
garding the link). 


One-Shot Design 

In many teaching programs, teachers (and administrators) want to know whether 
students meet the objectives set foi the course. At the end of the course (and 
whatever happened during the course is an example of ''treatment" in research ' 
jargon), students take a test. T he schematic representation for this design is: 

T - X 

where T stands for treatment and A’ for the test results. 

This is a very simple design but the results must be interpreted with great care. 
While you may be tempted to say that the treatment "worked" and want to share 
the results with other teachers at the next TESOL conference, you should be wary 
of doing so. The results may not be valid and you cannot generalize from them 
with confidence. As we have said many times, the research process is meant to : 
give you confidence in your description of results and (if you wish to go beyond ! 
description) the generalizability of your findings. The study has given you data ; 
that you can describe, but in interpreting the data you must caution listeners and ; 
readers to consider the possibility of threats to validity and reliability of the re¬ 
search. (You can quickly review these in chapter 1 if you don't remember them 
now.) 

To try to take care of some of these problems, researchers sometimes use stan¬ 
dardized tests with published norms. When possible, of course this is a good idea. 
However, Campbell and Stanley warn that we still can't say much about the re¬ 
sults even if the T has been carefully described. That is, one might attribute any 
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differences to many factors other than the treatment. (And, yes, that is "error" 
as far as you are concerned because you are not interested in these other factors.) 


One-Group Pretest-Posttest Design 

\ if you gave the learners in the above example a pretest on the first day of class 
i and a posttest at the end of the course, you would have changed the design to the 
following schematic: 

*1-7--* 2 

By giving the pretest, you can assure yourself that students did not already know 
the material tested on the posttest. The administration of the pretest could be a 
: threat to the validity of the research, especially if the time between pretest and 
; posttests is short, the items very similar, or if the pretest gave 5s pointers on what 
j to study. 

The pretest and posttest do not have to be "tests". I hey can be observations. 
For example, you might observe the behavior of 5s using some observation;:! 
checklist before and after some special instruction. Test is just a cover term for 
all kind-, of observations. 

Imagine that your school has a very poor record of parent involvement. Very few 
: parents \isit the school. You have a record of the number of visits by parents for 
the first three months of school. Some teachers believe that the parents are juM 
not interested in their children's education. You don't agree. You think that the 
school docs not appear to be a very "welcoming" place, particularly to parents of 
bilingual children. You have already been videotaping some of the classrooms 
and the tapes show children engaged in many different activities. You ask the 
children if they would like to have their parents see them on television on parents' 
i night. They are all excited about this possibility. You and the students design 
! special invitations to the parents. You ask the bilingual children to add a note 
in their first language to make the invitations even more special. The principal 
of the school agrees to collect and display some of the best compositions from 
| various classes and put them on display. And, of course, some of these compos- 
; itions are written in the home language of the children. Signs are posted not only 
: in English but in the home language as well. Since none of the office staff speaks 
this language, several of the children are invited to come and serve as special in¬ 
terpreters to welcome the parents. Parents' night is a great success. Not only 
: that, but some parents accept invitations to visit classes and do appear in the next 
: two weeks. 

The pretest measure is the number of visits by parents before parents' night. The 
posttest is the number after parents' night. It is difficult to say precisely what the 
treatment may have been. For the sake of the illustration, we will say it was 
parents night (not all the preliminary activity). The pretest-posttest design for 
! one group has many drawbacks that you can consider in your study group. 
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OOOOOOOOOOOOOOOOO-OOvOOOOOOOOOOOOOOOOO • 

Practice 

I. In the parents' night example above, do you believe that the pretest measure 
in any way influenced the posttest measure? 


2. If the treatment is activities during parents' night alone, are there activities 
between the pretest and the treatment that might affect the outcome? 


If the treatment is parents' night alone, what history factors (rather than the 
treatment) might be related to improved parental involvement? 


3. Having carried out the research, with whom would you want to share your : 
findings? 


If you shared the findings with teachers a* a parallel school in your district, could, 
you generalize your findings to their district? Why (not)? 


4. You know that students at a nearby college can get stipends if they sene as 
interns in work programs. Some of these students are bilingual. How could you 
use the above information as evidence to secure an intern to work at the school I 
to promote parent involvement?_ j 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO j 

Intact Group—Single Control 

If you are working with a class and there are students in other sections of that 
same course, it is possible to establish a control group for the research. 5s have 
not been randomly selected for the course nor randomly assigned to sections of 
the course. However, you could randomly assign the special treatment to the 
sections by the flip of a coin. 
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The representation for this design is as follows ((/, stands for the experimental 
group, group which will receive your special treatment, and (7% stands for the 
control group.) 

<7 1 (intact) T X 

C 2 (intact) -OX 

This design is an improvement over the previous design. The above schematic for 
this design suggests that only one treatment and one control is used in studies of 
this sort. It is quite possible that several treatments might be compared with 
single or multiple control groups. The distinguishing features of the design arc 
that the groups compared arc intact groups ( not randomly sampled and randomly 
selected); there is a pretest and posttest measure; and there is a control group. 

Since this design is often used in our field for program or materials evaluation, 
let's review some of the other possible threats to validity and reliability of such a 
design. Imagine that you wished to evaluate the effectiveness of immersion pro¬ 
grams. In these programs Ss receive education in an L2 which is, typically, a 
minority language-for example, English-speaking children receive their early ed¬ 
ucation in Spanish in l.os Angeles, or in German in Milwaukee, or in Arabic in 
Cleveland. To show that such programs do not harm the children in terms of 
further first language development, the children might be compared with an ap¬ 
propriate control group on their scores on English language subject matter tests. 

In this case, the treatment and control arc definitely different We would, how¬ 
ever, need to know precisely how they differ in terms of English use. That is. if 
activities in the treatment and control overlap, we need 10 know the extent of the 
overlap. If the overlap is great, then the outcomes would be different than if the 
overlap were slight. 

It is also important to know that 5s in these classes are (usually) volunteer stu¬ 
dents. They are, thus, not typical of all elementary school children. This could 
I affect the results and generally limit generalizability. It would be important to 
check for differential dropout rates in the experimental and control groups. It is 
likely that 5s who do not do well in the experimental program will dc-sclect 
themselves and, perhaps, appear then in the control group. If they were generally 
the weaker students to begin with, and they shift groups, the results would be 
differentially affected. If some of the tests were observations conducted by- 
trained observers, it would be important to know whether the observations were 
"blind" (j.e., the observer did not know what groups were obscrvcd-cxpcrimcntal 
or control-or the purpose of the observation.) The training of the observers 
could influence the outcome (the observations noted might be differentially af* 
i fected). 
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OOOOOOOOOOOOOOOOOvOOOOOOOOOOOOOOOOOOO 


Practice 3.4 

1. Assume you wish to analyze the use of metaphor in the short stories of your • 
favorite writer. Because you don't have access to an optical character reader, you ; 
have entered only one short story into the computer and coded all the metaphors, i 
You want to report on the frequency and types of metaphor in the text. Identify ; 
the design classification and give the schematic representation. 


2. Your library research shows that, because this author was so successful, this ; 
use of metaphor became common in short stories between 1935 and 1940. For- i 
tunatcly for you, you have been able to meet surviving members of the writer's ! 
family, and they tell you that the Huntington Library has a number of unfinished ■ 
and unpublished manuscripts which were written while the author lived in i 
Malaysia (1936-40). These manuscripts include several articles written on the | 
birds of Malaysia. Once you read these articles and the unpublished stories, you i 
believe the author's use of metaphor in short stories changed (you hope to estab- ; 
lish a "bird period" as part of your dissertation research!) What is the best design ; 
for the reseaich? Why? Identify the design and give the schematic represeu- j 
tation. 


3. Imagine that you wanted to evaluate the effectiveness of immersion programs ; 
as compared with foreign language programs. You have located schools where i 
similar children receive instruction in Spanish, German, and Arabic. FLES : 
(foreign language in elementary school) offers direct language instruction (rather ; 
than expecting children to acquire the language via instruction in content i 
courses). In your study group, discuss the possible threats to the validity and ■ 
reliability of such a comparison. How many and what types of control groups | 
would you use in such an evaluation? In your discussions, consider the treat- j 
ments being compared. Exactly in what ways do you imagine they differ? What i 
factors (other than language immersion vs. language instruction) might influence j 
differences between the groups? 

Group report:_ 
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ooooooooooooooooooooooooooooooooooooo 

Time-Series Designs 

Because of the problems involved with random assignment and the difficulties in 
finding control groups that match the experimental group in many ways, re¬ 
searchers often turn to time-series designs. While the 5s selected for such studies 
arc no: usually randomly selected (they can be to maximize generalizability), 
these designs solve the control group problem in interesting ways. 

In a timc-scrics design, the class is its own control group. The time series means 
that several pretests and posttests are administered. These don’t have to be 
“tests," of course; they might be your observation records, student performance 
in the language lab, or answers to questionnaires. The schematic representation 
for the design might be: 

•v, V, -V, /' V, X, A',, 

By collecting data prior to the treatment, you can establish the normal growth in 
performance over a period of time (say, three weeks), then institute the treatment, 
and fivlow this with observations following the treatment (weeks 4, 5, and 6). 
There is no special number of observations required before giving the treatment. 
Once you feel you've established the growth curve, you can begin. 

Let's use an example to see how this works. Imagine that your .students arc pri¬ 
marily from China and Southeast Asia. They use many topic comment struc¬ 
tures in their compositions as well as in their oral language. (Topic comment 
structures are sentences such as How to write acceptable English sentence, that is 
my first problem in picking up my pencil.) During the first three weeks of class you 
note each student's use of topic/comment structures in written production. They 
produce large numbers of topic/comment structures. When you feel that you 
know the usage pattern, you institute a program to show and give practice in the 
ways that English manages to preserve word order and still fulfill the pragmatic 
need to highlight topic vs. comment. You continue, then, to look at the students' 
use of topic/comment structures to see whether they begin to diminish ar.d 
whether the change is sustained without further instruction. 

If you found results such as those shown below, you would be led to conclude 
that your instruction had no effect. 
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Time 1 2 3 4 5 6 


If you found that a rapid drop took place following instruction, you would as- : 
sume it did have an effect. Students made fewer topic/commcnt errors in their 
writing. 


Freq. 


Time 1 2 3 4 5 6 


If you obtained a line similar to the following, you might believe that the in¬ 
struction (I) was detrimental, or (2) totally confused some students but may h tve 
helped a few, or (3) that following a period of confusion the learning curve would 
be as dramatically reversed. 


Freq. 


Time 1 2 3 4 5 6 




There are many possible variations in time-series designs. For example, if you 
wanted to know which of two techniques worked best, you could vary the design: 

X\T\X 2 * *,7*2*4 -> A: 5 r,AT 6 - X n T 2 X % etc. 
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With this design you would alternate periods when you used one set of instruc¬ 
tional materials (7^) with times when you used the second set (7 2 ). Or, if you 
have two groups, you could alternate treatments so that, at time 1, group A re¬ 
ceived special materials and group B did not, and then reverse this at time 2 as 
in the following diagram. 

Time 1 

Materials t 

Group A 

M at eric Is c 

Group B 

If a time-series design does not seem possible for the particular research question 
you want to investigate, you could still find ways of letting your class serve as its 
own control. You could divide the class in two groups (using random assignment 
of students to the groups) and then randomly assign control and experimental 
status to the groups. This means that you'd have to prepare separate lesson plar.s 
for the groups and administer them in a way that neither group was aware of the 
treatment of the other. The problem, here, is that you arc likely to end up with 
very few students in each group and-as you will learn later--the fewer the num¬ 
ber of subjects in a group, the more difficult it will become to discover whether 
differences between the groups are "real.'' 

Time-series designs arc much more useful than a pretest given at the beginning 
of the year and a posttest at the end of the year for many applied linguistics 
:projects. It is ideal for evaluation in materials development projects. 

Some school districts and departments of education have professionals who en¬ 
gage in program evaluation, ^valuators of bilingual education programs often 
complam that one-year pretest to posttest evaluations on mandated tests cannot 
show the true gains made in such programs. They suggest that research be lon¬ 
gitudinal, following children for a minimum of two years. A time-series design 
would be ideal for parts of such evaluations. 

You will remember (that means you might not) that in chapter 1 we considered 
the possibility of making the study of Luisa more feasible by combining a longi¬ 
tudinal (timc-scrics) design and cross-scctional design. The same sort of combi¬ 
nation could be carried out for the evaluation of instruction. You might want to 
carry oat a time-series design within your class at a series of points. In addition, 
;you might conduct a cross-sectional study with other classes of students whom 
you expect to be at the same point of development. 


Class A 

X 

T 

X 

T 

X 

Class B 
Class C 
Class D 

X 


X 


X 


There are definite advantages to the use of time-series designs. We know that 
■even though we olTcr students instruction and they are able to perform well dur- 


2 3 4 

BAB 
ABA 
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ing the instruction period, they may not truly internalize ihc material unless it is 
recycled over a fairly long period of time. Since this is the case, longitudinal 
time-series studies arc frequently used to discover how long it takes students to: 
reach the goal and the amount of variability of performance along the way. 
There are disadvantages too. For example, if you conduct your research in 
schools and the families in your district are very mobile, the attrition and the new 
students constantly arriving will make it difficult to use such a design. 


ooooooooooooooooooooooooooooooooooooo 

Practice 3.5 

1. Review the diagrams for the Chinese topic comment study on page 91. How 
would you interpret the following two diagrams? 



Time 1 2 3 4 5 6 


Explanation: 


Tune 1 2 3 4 5 6 


Explanation: 
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2. In working with these Chinese 5s, how might you incorporate control groups 
Into the design to evaluate the effectiveness of an instructional program that ad¬ 
dresses topic comment in English? 


Would the design give you more confidence in your findings? Why (not)? 


3. In your study group, consider how you might design a time-series project to 
evaluate a set of course materials. Assume that the new course materials do meet 
the same objectives as the previous course materials. (You might want to tackle 
the problem of how you might handle differences in the amount of time required 
in new materials versus old materials.) 

Group report:____ 


4. Summarize the advantages and disadvantages you see for time-series designs. 


5. Review the research questions you posed for your own research in chapter 2. 

Which could be carried out using a time-series design?_ 

Give the design scheme below. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Experimental Studies 

We have already mentioned that true experimental studies are relatively rare in 
applied linguistics. Yet, we strive to approach the "ideal" form of such studies. 

True experimental studies do use control groups. They also assess and. or control 
for differences between groups prior to the start of the experiment. Most impor¬ 
tant. they require random selection of 5s and random assignment of .Ss to controi 
and experimental groups. Finally, the assignment of control and experimental 
status :s also done randomly. This means that all 5s of an identified group have 
an equal chance of being selected to participate in the experiment. Once selected. 
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the .S’s have an equal chance of being assigned to one group or another. Ti e 
groups have an equal chance of being assigned to control and experimental sta¬ 
tus. 


Random Assignment Posttest 

The schematic for this design is: 

G j (random) - T — X 
G 2 ( random ) - O — X 


Suppose a school has a very large enrollment of newly arrived students from 
other countries. In the past, these students were tested and placed at the appro¬ 
priate level for ESL. However, the ESL program teachers believe that an orient 
tation to school life and school responsibilities should also be available to these 
students. A program has been designed but. before the school commits money 
to the program and personnel, it would like to know that it promises to be effec¬ 
tive. 

All entering nonnative speakers designated as ESL students form the pool front 
which .S's arc randomly selected for assignment to condition A or condition I}. 
Obviously, the school docs not want to do a straight random assignment for th.s 
study. It should be a stratified random selection. Eollowing the stratified ran¬ 
dom selection and assignment to the groups, with a flip of the coin, group I .s 
selected as the control group and group 2 is selected as the experimental group. 
The control group receives the regular LSI. training. The experimental group 
receives ESL but for some part of the ESL program time, students also receive 
an orientation. At the end of the program, students are given a variety of mea¬ 
sures to evaluate the effectiveness of the programs. The researchers expect to 
generalize the findings to all entering nonnative speakers (so long as they match 
the original group from which the samples were drawn). 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 3.6 

I. List the ways stratification might be done for the above study. 


In this study, were students randomly selected? 
Were they randomly assigned?_ 
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iWere the experimental and control groups equal to begin with? If yes, how was 
Jihis insured? If no, why not?_ 


2. How carefully should you describe the treatment? How might you go about 
making sure that the treatment really is as described? _ 


What history factors would you check out? 


3. How might you assure that the posttests are kept anonymous (as to whether 
they are from the experimental or control group) during the coding or scoring 
process? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Control Croup Pretest-Posttest 

The schematic representation for this experimental design is: 

G'| (random) Xy~ T- X 2 
G 2 (random) — X\ O - X 2 

With the addition of a pretest, the design can be improved. (In the previous de¬ 
sign you may have controlled for equivalence of the groups by specifying the skill 
level as one way of selecting a stratified random sample. Random selection also 
helps ensure equivalence of groups since every 5 has an equal opportunity of be¬ 
ing selected and assigned to experimental and control groups.) With a pretest, 
you have a number of options. You can actually match individual students in the 
experimental and control groups on the basis of their pretest scores, and compare 
the performance of these matched groups. Another possibility is to subtract each 
5 s pretest score from the posttest score and compare the gains (rather than the 
final test scores) of the Ss in each group. Finally, there are statistical procedures 
(e.g.. ANCOVA) that will let you control for the student's pretest ability in ana¬ 
lyzing the final test scores. All of these options increase the internal and external 
validity of the study. That is, you can feel more confident that any claims you 
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make about differences in the two groups after the treatment are not due to pre¬ 
existing differences in the groups. 

Let's illustrate this with another school project. Since U.S. schools across dis¬ 
tricts, and sometimes statewide, have certain units that are covered at each grade 
level, and because stale F.SI. supervisors have decided to cover the ESI. objectives 
using the content material for that grade level, new materials for ESL-math and 
ESL-social studies have been designed. One such ESL-social studies unit for the 
fourth grade covers the California Missions. Before the school year began all 
fourth-grade ESL classes (classes now act as though they were subjects in this 
study) in the school district were given an identifying number. Sixty numbers 
were then drawn at random to participate in the study. Thirty of these numbers 
were drawn and assigned to group A and 30 to group B. With a toss of a coin, 
group A was designated the experimental group and group B, the control. 
Schools were notified and teachers in both groups were given an in-service work¬ 
shop on the California Missions. The teachers in the experimental group, how¬ 
ever, were also shown how to use the unit to meet ESL objectives for the fourth 
grade. Pretests were sent to all participating schools and administered by school 
personnel other than the participating teachers. The course materials were taught 
and a posttest was given to control and experimental classes. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Prac tice 3.7 

1. Have the Ss (classes) in the Missions study been randomly selected? Have 
they been randomly assigned? Has status (treatment or control) been randomly 
assigned? If not, why not?_ 


2. How would you feel if you knew that your class had been selected to trial the 
new ESL materials on the California Missions?_ 


Do you think it is important to notify teachers that their ciass will serve as a 
control group in a study? If not, why not? If so, how much information do you 
think they should be given? 
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3 . What possible threats do you see to external or internal validity for the stud}? 


4 . Review the research topics you have posed for your own research. Which 
could be carried out using a true experimental design? Which design would you 
prefer? Why? How feasible is your choice? 


5 . In some cases, you might not be sure whether it would be possible to use 
random selection, control for preexisting differences, or exactly how many control 
groups would be needed. Discuss these in your study group. Note the sug¬ 
gestions in the space provided. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Ex Post Facto Designs 

Since it is not often the case that wc can meet all threats to internal and external 
validity, and can arrange for random selection and random assignment, we are 
not able to make causal claims-claims of cause and effect of independent 
variable(s) -» dependent variable(s). While we believe one should not make 
causal claims without satisfying the requirements of experimental design, we can 
still discuss the type and strength of the relationship of the independent and de¬ 
pendent variable in your particular study. When we cannot reach the ideal of 
true experimental design (random selection and assignment, control of preexisting 
differences among groups, and the use of control groups), what's a body to do? 

One possibility is the use of ex post facto designs. In such designs you will look 
at the type of connection between independent and dependent variables or the 
strength of the connection without considering what went before. No treatment 
is involved. Good design requires, however, that you consider all the possible 
threats to the validity of the study and try to control for as many of them as 
possible. 

As an example of an ex post facto design consider the following. 
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Your curriculum is set up for a total skills approach. You notice 
that there seems to be a great range within your class and also 
across all the course levels in terms of how quickly students are able 
to complete tasks. You wonder if this is due to slow reading speed. 
The first step, then, is to discover the average reading speed of stu¬ 
dents across the levels. Are students in the beginning sections 
reading more slowly than those in intermediate sections, and are 
students in the advanced sections reading most rapidly? Or, is it the 
case that there arc such wide ranges in reading speed at each of 
these class levels that course levels are not really related to reading 
speed? 


The dependent variable is reading speed. You want to discover whether class 
level (the independent variable) can account for differences in reading speed (the 
dependent variable). There are three class levels: beginning, intermediate, and 
advanced (the class level defined more precisely in the study). Class level is a \ 
nominal variable. Reading speed is an interval scale variable. 


The design box is: 


Reading 

Speed 


Beg 


Interm 


Adv 


In this project, is the research trying to show that performance has been improved 
on the basis of instruction? No, it is not. Arc causal relations to be established? 
Nm. Are any of the variables being manipulated to cause a change? Again, no. 
So, the project is an cx post facto design. 


This h the most common design type in applied linguistics for it allows us to dis¬ 
cover 'what is going on" rather than "what caused this." We may videotape our 
language class and then ask which students take the most turns--do boys or girls 
volunteer more often in the classroom? This will tell us "what is going on" and 
could lead to the design of a program to encourage more equal participation 
(though that is not part of the research). We might look at the amount of teacher 
talk vs. student talk. Again, this will tell us "what is going on" and could lead to 
the design of a teacher-training module on ways of encouraging more student 
participation. We might describe the types of feedback the teacher gives students 1 
or how students react to feedback from fellow students and from the teacher. 
The videotape data could be used to answer many "what is going on" kinds of : 
questions. The design, in each case, is post hoc; it lets us describe some data and 
see how the values vary across groups of subjects, across tasks, and so forth. 


We use post hoc designs for text analysis too. The research tells us what is going 
on. not the effect of some treatment. We can examine text features (c.g., use of 
modals) and see how they vary across genres (e.g., narrative vs. procedural text). : 
The analysis will describe what is already there, not a change brought about by 
some instructional treatment. 
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Post hoc designs are common in reporting test results. We might want to com¬ 
pare the results of students from different first language groups, or those of im- 
finigrant vs. foreign students, or those of students who have gone through a series 
pf courses in our program vs. those admitted at some higher level. Again, the 
Research tells us about the relationship of variables in the data, not about the ef¬ 
fectiveness of some instructional program or treatment. Ex post facto designs 
; Which incorporate random selection, of course, would allow for generalization of 
results to some degree. However, one reason these post hoc designs arc so com- 
fnon is that many of our topics have nothing to do with a treatment. Rather, we 
• want to discover the effect of some independent variable on a dependent variable 
i(e.g., whether first languagc-not a treatment-influences the types of transfer er¬ 
rors of ESL students). Another reason, though, is that finding out "what is going 
bn" is a first step in planning for instructional innovation. Once we have confi¬ 
dence in our descriptions of what is happening, we can plan for change. The 
evaluation of the treatment for bringing about change will, of course, require a 
different design. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 3.8 

I. A fellow teacher-researcher uses the Project Approach and wants to accumu¬ 
late evidence that the approach "works." From a list of possible projects, .Vs in 
this advanced ESL composition class decided to write an information brochure 
for deif students who will spend six weeks on campus during the summer. I he 
project requires that S's interview leaders of community services, business, and 
government agencies regarding deaf awareness in the community, and recreation, 
transportation, cultural activities, work, worship, and business opportunities for 
the visitors. It also means compiling information on housing, meals, library, tu¬ 
torial service, recreation, and so forth on campus. To keep track of plans and 
progress, the teacher and 5s will use computer electronic mail on linked terminals. 
The teacher envisions grouping 5s to be responsible for different sections of the 
brochure, peer critiques across groups, teacher critiques, assistance from the 
printing office on layout and formats, and final critique of the brochure from the 
Summer Sessions Office. In addition to interviews, Ss will give oral progress re¬ 
ports to the coordinator of the program, to Summer Sessions personnel, and to 
community leaders. They will give written reports to the campus and local 
newspapers. They might be interviewed or asked to write a paper on the com¬ 
posing process and how the brochure might be improved. The teacher is open to 
any suggestions concerning design and data collection. 

In your study group, assign each member a design type. Use your assigned de¬ 
sign type to design the study (oi some poilion of the study). List the strengths 
and weaknesses of the design type for this particular study. In the group, com¬ 
pare the results and determine which design, among those that are feasible, is best 
suited for this study. Report the results and justification for the choice below. 
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2. Repeat this process for your own research topic. Design the study using each 
of the design types. Then select the best design type and justify your selection to 
your study group. Report the results below. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


In this chapter, we have considered the construction of research designs. Such a 
consideration is important for several reasons. First, it will help you determine 
the best design for your particular study—ways of avoiding possible threats to 
internal and external validity. Second, it will clarify the research so that you car. 
more easily determine which statistical procedures will be most appropriate for 
the analysis of the data you collect, fhird, it is useful to think about possible 
threats to the validity and reliability of research in balance with the complexity 
of the Final design. The simpler the design, the easier it will be to select an ap¬ 
propriate procedure and interpret results. Now. in the next chapter, let's turn to 
writing a project proposal using the infonna:ion we've acquired so far. 


Activities 

1. L. Scright (1985. Age and aural comprehension achievement in Francophone 
adults learning English. TESOL Quarterly, 19, 3, 455-473.) The 5s in this study 
were military personnel enrolled in an intensive ESL course in Canada. 5s were 
classified as older (25-41) or younger (17-24) learners. Eighteen pairs of 
oldcr/younger 5s were obtained by matching for (a) informal exposure to English, 
(b) the pretest, (c) nonverbal IQ, (d) education, and (e) previous ESL instruction. 
A pretest on aural comprehension was given prior to a 12-week course of in¬ 
struction. Then a posttest was given. On the basis of this brief description, state 
the general research question. Which of the ways of matching Ss do you think 
arc especially important to the research? Why? What other ways of matching 
do you think you might have used had this been your study? Give the schematic 
for the design. 

2. J. T. Crow & J. R. Quigley (1985. A semantic field approach to passive vo¬ 
cabulary acquisition for reading comprehension. TESOL Quarterly, 19, 3, 
497-514.) compared the effectiveness of an ordinary method of teaching vocabu¬ 
lary (with lists, derivations, matching exercises, multiple-choice, word substi¬ 
tutions in paragraphs, word puzzles) with a semantic-field approach. In this 
approach, a vocabulary word in context is identified with a key word and four 
other related words. For example, rage in a text would be given the key word 
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■anger and four related words (Jury/, ire, wrath, indignation). Students had practice 
■with activities such as picking out an unrelated word in a list of possible related 
words, substituting key words for target words, and so forth. Half of the words 
presented in the semantic-field condition were randomly selected and presented 
in the traditional method. In the study 42 5s were assigned to four sections of a 
class. Two of these sections became group I, and two became group 2. A pretest 
was given to all 5s at the beginning of the course. Group 2 classes were assigned 
to the experimental condition and group I classes to the traditional method. Af¬ 
ter instruction, the first posttest was given. Next the treatments were switched 
so that group 2 classes became the traditional group and group 1 classes the ex¬ 
perimental group. Following instruction a second posttest was given. Two ad¬ 
ditional foltow-up tests were given without intervening instruction using either 
method. From this brief description, state the general research question. Does 
this study qualify as a true experimental design? If not, how close docs it ap¬ 
proach the "ideal"? Draw a schematic representation of the design. 

3. T. Robb. S. Ross & I. Shortreed (1986. Salience of feedback on error and its 
effect on EFL writing quality. TESOL Quarterly, 20, 1, 83-93.) One hundred 
thirty-four Japanese college freshmen were assigned alphabetically to four 
sections of English composition. A cloze test on the first day of class showed no 
idifferences among the four sections, but an initial in-class composition did show- 
differences which were statistically controlled for in the final analysis. All 
sections received exactly the same in-class instruction. All Ss were required to 
revise their weekly homework compositions on the basis of feedback they receiv ed 
from their instructors. There were four feedback techniques: (1) complete cor¬ 
rection. (?) coded feedback (with a code sheet guide), (3) yellow highlighting of 
errors, and (4) listing the number of errors per line (5s had to find the errors). 
The compositions were graded on 19 different features which were then collapsed 
into 7 categories. Since no meaningful differences were found for the feedback 
treatments, the authors speculate that perhaps teachers spend too much tine 
worrying about composition correction. From this brief description, sketch a 
schematic representation of the design. How would you classify the design type? 
Does the study have random assignment of 5s? If this were your study, what 
other feedback techniques might you test? 

4. B. Tomlinson (1986. Characters are coauthors. Written Communication, 3, 
4, 421-448.) collected many examples of a prominent metaphorical story that 
novelists use to discuss the roles played by their characters in their composing 
processes. Authors talk about their characters coming alive, doing things the 
author didn't expect, suggesting what they want to say or do, wanting to take 
over the writing process, making evaluative comments, and so forth. The author 
gives many examples drawn from a review of over 2,000 published literary inter- 

; views and suggests that these are interesting metaphorical data that show how 
novelists represent their composing processes. If you had similar data and wanted 
to share the results with others, but your field did not accept examples alone as 
evidence for speculation, what might you do? Can you think of any way that you 
Icould redesign this study for another audience? 

|5. L. B. Jones & L. K. Jones (1985. Discourse functions of five English sentence 
ftypes. Word, 36, 1, 1-22.) discuss the function of clefts, pseudo-clefts and rhet- 
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orical questions. They claim (on the basis of examples from written text) that ail 
three serve the same basic function of highlighting the theme. Here are some ex¬ 
amples: 

Pseudo-clefts 
What X liked 
What X finally meant 
What I am going to discuss 
What 1 am going to do here 

What was especially shocking 
What I like is 
What is needed is 
What can/should be done is 

Clefts 

It is evidently a universal that (show theme + discount 
It is sheer conceit that value judgment) 

It is in the course of X that 

It was X who (show theme and assigned focus 

It was a coincidence that to crucial item related to theme) 

It was not X, but Y that 

Rhetorical Questions 

How 7 important is A7 (show theme for a few 7 sentences) 

The question is why/what/etc. 

Assume that, you wanted to validate this information. Yon have the Brown cor¬ 
pus so you can make a computer search. What is your research question? What 
is the research design? 

In addition to making claims about the function of clefts, pseudo-clefts, and 
rhetorical questions, the authors also discuss extraposed sentences and sentential 
adverbs. The authors believe the function of these two structures is to mark au¬ 
thor comments. They give examples to show that these structures appear in re¬ 
marks, asides, and footnotes. Examples of these two forms are: 

Extraposition 

It should have been obvious that...but 

It is clear that X is 

It is unclear whether 

It is conceivable that 

It is astonishing to find that 

Sentential Adverbs 
Ironically, X is 
Interestingly, X is 
Incidentally, X is 


(show theme) 


was 

(show theme 4- emotive state 
or right/wrong judgment) 
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plow might you test this claim regarding the function and location of 
extraposition and sentential adverbs? Draw a design box for the study. Which 
of these two functions (author comments or theme identification) would you 
rather investigate? If you found the claims to be correct, which would be most 
useful in your language teaching? 
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Chapter 4 

Writing the Research Proposal 
and Report 


• Variation in formats 

• Organizing the proposal/report 

Introduction-related research, research question(s). 
Method-subjects, procedures, analysis 
Results and discussion 

Front and back matter-title page, abstract, references, appendices 


.Variation in Formats 

(Writing a plan for the research is the first chance we have to put all the pieces 
(together. It can serve as a rough draft of the final research report. Therefore, 
(you can save time and effort if you use the same format for the proposal as }'ou 
/Will ultimately use in the final report. 

(The research proposal will answer the questions that any critic might ask when 
ryou first say that you have a question or questions that you want to address: 

What is/are the research question(s)? 

What has already been done to answer the question(s)? 

What evidence do you expect to gather to answer the qucstion(s)? 

What are the variables in the study and how are they defined? 

Where or from what Ss (or texts or objects) do you expect to gather the 
evidence? 

How do you expect to collect the data? 

How will you analyze the data you collect? 

What do you expect the results to be? 

Exactly how will the results you obtain address the question? 

What wider relevance (or restricted relevance) does the study have? 

Are there any suggestions for further research? 

Where can the related literature be found? 

If new materials, tests, or instruments arc proposed, what does a sample 
look like? 

If funding is sought, what are your qualifications (your curriculum vitae)? 
Timetable for the research 
Budget for the research 
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1 his list looks formidable. Whether you will need to answer every question de¬ 
pends on the purpose for which you prepare the proposal. However, even if you 
don't write out an answer to each question, you should consider all questions 
carefully, l or example, you may not apply for funding and the budget may not 
need to be specified. That does not mean you shouldn't think about it in pre¬ 
paring your plan. 

If you prepare the proposal for anyone other than yourself, the first thing to do 
is to inquire whether there is a set format that you should follow. Grant agencies 
use forms that vary in detail. For example, a Guggenheim grant asks for no more 
than three double-spaced pages outlining the research plan. NSF (National Sci¬ 
ence Foundation) grants typically ask for a complete and detailed proposal. You 
can check with your university or library to find out where information on grants 
is available. Special grants (e.g., Woodrow Wilson grants, Fulbright research 
grants) may be available for graduate students. 

Your department, if it is like ours, may give graduate students a detailed outline 
for thesis and dissertation proposals. You may be required to give not only an 
extensive description of the study but a list of relevant course work and a detailed 
timetable for completion of the project. Whether the format is open or extremely 
detailed, the same questions will need to be answered. 

If your proposal is to be submitted in one of the more common formats such as 
the APA (American Psychological Association) format, the MLA (Modern Lan¬ 
guage Association) format, or the Chicago style sheet, computers can be very 
helpful. Nota Bene, and other personal computer software programs, offer you 
a choice of many of these formats. In addition, most universities have mainframe 
programs where a special software program formats the text in exactly the style 
required by the university. Once you become acquainted with such programs, 
they are invaluable-they make research more feasible by saving you time and 
money. 


ooooooooooooooooooooooooooooooooooooo 

Practice 4.1 

1. Obtain instructions for writing a proposal from your department, university, 
or gran: agency. Look at the instructions. How similar is the format to that 
given in this chapter? In what order do questions (those that must be answered 
in all research projects) need to be presented? 


2. Check the range of computer software (and hardware) available for report 
writing and for data analysis. In your study group, evaluate the software avail- 
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liblc for your use. Which programs are best for your word processing needs? 
ffjjb these programs also offer possibilities for data analysis? 


ooooooooooooooooooooooooooooooooooooo 

Organizing the Proposal/Report 


Introduction 

Typically, proposals and research reports begin with a short introduction. In 
most research formats, the introduction is not labeled as such. Rather, the writer 
begins with a brief statement that shows the question is timely and of importance. 
Because novice researchers sometimes find it difficult to break the "writer's 
block." it is worthwhile to pay attention to the almost formulaic phrasing of these 
opening sentences. You may not want to use them yourself, but they are always 
available to get you started. 

Recent research has suggested xxx. Such (research, work, an interpreta¬ 
tion j has many strengths but... 

The issue oj xxx is one which has puzzled researchers for many years. Re¬ 
cent work, however, has offered interesting ways in which we might ap¬ 
proach this issue. 

A number of proposals have been made in regard to xxx. 

Contemporary research into the nature of xxx emphasizes the role of xxx 
factors. A large part of this research has been based on the idea... However, 
in spite of the diversity jagreement of... exactly how we might xxx is not well , 
defined. 

The notion of xxx commands a central position in (theory, research, in¬ 
struction). While..., there is considerable controversy regarding... 

One of the most hotly debated topics in xxx has been the significance of... 

Researchers and teachers have been interested in X and Y throughout the 
twentieth century, but only recently have... 
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Although xxx has always been recognised, scientific interest in this topic has '■ 
developed slowly;rapidly over the past x years. 


xxx and yyy are. for the (applied linguist, educational psychologist, lan¬ 
guage teacher), two of the most interesting facets of... 


After the opening lines, the researcher provides a brief review of the most relevant 
research reports on the question. At the end of the review, it should not only be j 

clear why the question is "interesting," but also why previous research has failed \ 

I to answer the broad research question (e.g., not enough research to answer the 
broad research question; the research was somehow flawed; the research involved j 
Ss that differ from those you wish to select; the procedures you wish to use will 
overcome previous threats to validity; and so forth). 


Near the end of the introduction, the research questions are stated. Interesting:}', • 
some review committees (whether grant committees or department committees) : 
as well as some journals prefer that the research question not be stated in the nall 
form. The reason is, we believe, stylistic. Be sure to check with your advisor (or ; 
look at past issues of the journal) before deciding on the precise form of the re¬ 
search question. Operational definitions cf terms may also be given here. And, ; 
at this point, the researcher may also place disclaimers to show limitations on the 
scope of the research. 


If you are preparing a proposal for a thesis or dissertation, it is likely that the : 
introductory section will be extensive. Typically, committees expect that you will 
review all the background studies in the research area as evidence that you ire : 

1 thoroughly prepared to undertake the study. This review may form a chapter of 
your final document. In rewriting the document as a journal article, only the re- ; 
search most directly iclevant to the study would be used. Thesis proposals often I 
give much more documentation to the importance of the study, the original con- ; 
tribution the study will make to the field, and its relevance to theory or practice. ; 
Again, this is evidence that the author is qualified to undertake the research. In , 
preparing a journal article on this research, the author, with a few brief sentences, : 

| can make the connection between research and theory for readers. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 4.2 


1. Select one of the research articles you reviewed for the exercises in the previous i 
chapters. Outline the introduction section for this study. Where are the research 
questions posed? How detailed is the introduction? How extensive is the litera- j 
ture review? Where are the research questions posed? Are formal hypotheses ; 
stated for the questions? Is there a "limitations to this study" section in the ;n- ; 
troduction? Are operational definitions given for the key terms in the research? 
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ijl; Compare the results in your study group across all the articles surveyed. How 
|nuch variability is shown across the studies? How might you account for this 
Variability? 


ooooooooooooooooooooooooooooooooooooo 

Method 

The introduction tells us what the study is about. The method section tells us how 
fthe study will be carried out. The method section will, of course, vary according 
ftb the type of study. Typically, though, it begins with a section describing the 
Vlata source—the "unit of observation" (the 5s and their characteristics, or the 
schools/classes and their characteristics, or the text classifications and character¬ 
istics, or the classes and characteristics of objects from which the data are drawn). 


Subjects 

The description of the data source should be as complete and precise as possible. 
In journal articles, we do not expect that every detail will be mentioned—journals 
do not have that kind of space nor do readers want to know every tiny detail. 
We do, however, expect that details will be given for variables that are important 
to the study. The major criteria in evaluating these descriptions are precision and 
replicability. When research is replicated, it is not unusual to consult the original 
author for further details. However, readers of articles also evaluate descriptions 
using these criteria. If the descriptions do not allow replication, then in some way 
they arc not precise enough to allow the reader to interpret the results. As an 
example, consider each of the following fictitious descriptions. Is sufficient in¬ 
formation given so that replication and/or interpretation of results would be pos¬ 
sible? 

Subjects 

Thirty native speakers and 30 nonnative speakers of Spanish will serve, as 
subjects for this study. The nonnative speakers are enrolled in an advanced 
Spanish conversation class at Houston Community Adult School. The non- 
natives have English as a first language. Three of these have studied other 
foreign languages in addition to Spanish. 

/Comments on description of data source: 
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I his description of-Vs is complete in terms of information on number of subjects 
and their first language. While we know they are enrolled in an 'advanced' 
Spanish class, we do not know much about how fluent they arc in Spanish. We 
need an operational definition of "advanced" from the author. Other demo¬ 
graphic data (c.g., age, sex, travel to Spanish-speaking countries) might be 
needed, depending on the research question. 

Texts 

The texts for this study are 30 200-word samples randomly selected trorn 
five short stories by American authors. The short stones appear in El-T,: 
VoI 8, a reading textbook for advanced EEL students. 

Comments on description of data source: 

Wc do not know how the random selection was carried out. For a detailed de¬ 
scription of random sampling procedures using a table of random numbers, you 
might consult Shavclson (1988, pages 10-11). Wc do not know how represen¬ 
tative these stories arc of American short stories. We do not know if the stories, 
arc original or were adapted (simplified). Wc do not know whether the sample 
size is sufficiently large to contain examples of the variables being studied. 
Whether these arc important issues could only be determined by reading the re¬ 
mainder of the article. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 4.3 

I Comment on the following descriptions. 

Classes 

The classes selected for this study include five fourth-grade bilingual class¬ 
rooms located in a lower middle-class area where little residential mobility 
has been noted. The control classrooms are five fourth-grade bilingual 
classrooms located in a similar area where high residential mobility is the 
norm. Questionnaire interviews will be conducted with parents to obtain \ 
permission for the children to participate in the study. These interviews, 
conducted in the home language, will solicit further information on parents' 
education, employment, and length of residence as a check on school equiv¬ 
alence for the study. Other questions will tap data relevant to the research 
questions. 

Comments on description of data source:_ 


Speeches 

Transcripts of presidential nomination speeches and nomination acceptance 
speeches given from 1920 to 1988 were obtained from published documents 
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(New York Times: Vital Speeches of the Day, 197 /. and Los Angeles 
’Times). These transcripts comprised the data base of the study. 

Comments on description of data source: 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


Whether the data sources are people, texts, classes, or whatever, permission must 
be obtained to use the data. The grant agency or your department or institution 
: may have set forms to use as models in securing this permission. If for any reason 
you find no such formal requirements, you should still get written clearance to 
'protect yourself as well as your 5s. 

No matter how willing your 5s may be to have themselves identified by name in 
your research project, always assign pseudonyms, initials, or random identifica¬ 
tion numbers. 

Sometimes it is crucial to a study that absolutely natural data be acquired. Many 
researchers feel the data will be compromised if prior permission is sought. Your 
department, grant agency, or institution may not allow you to collect data unless 
you secure permission to use the data before they have been collected. Some in¬ 
stitutions will allow use of data when permission is sought after data collection. 
Be sure to check before collecting data. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 4.4 

1. In your study group, collect examples of clearance forms. How similar are the 
forms? Discuss the ethics of prior vs. delayed permission requests with members 
of your study group. What consensus can your reach? 


2. Critique the "Subjects" part of the article you chose to review on page 110. 
How precise is the description? Could you select equivalent Ss (or texts or ob¬ 
jects) based on the description?_ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


Chapter 4, Writing the Research Proposal and Report 113 






Procedures 


The next part of the method section typically contains a description of proce-f 
durcs. Some journals and research reports have separate sections to describe! 
materials and procedures while others combine these into one section*-! 
procedures. If you combine them in one section, it is still customary to havc| 
headings for each within the section. 

It is important that the materials (and, where relevant, any special equipment):! 
and the procedures be described in enough detail that anyone could easily replt-S 
cate the study. Consider the following description of materials: 

There are two versions of the test, A and B, each with two sections. The: 
task-based section uses pictures and}or short instructions to elicit talk; thei 
imitation section includes 10 sentences ranging from 2 to 12 words which 
students hear and repeat. Schematically, the forms are as follows: 

Form l: Tasks A, Imitation A 
Form 2: Tasks B, Imitation B 
Form 3: Imitation A, Tasks A 
Form 4: Imitation B, Tasks B 

The order of presentation of the sections was varied to control for possible 
ordering effects; the two versions were created so that one could be used as 
a posttest. 

The task section includes a "warm-up" short-answer section plus five major 
tasks: description, narrative, process (giving directions), opinion, anti) 
comparison-contrast. Tasks were chosen on the basis of whether or not they 
fulfilled certain criteria: (a) their " authenticity " or naturalness—whether) 
they were perceived as representing the kinds of speaking activities in which 
students engage in the university environment, as well as in everyday life; 
(b) their match with the rhetorical modes taught in our courses (evert 
though the emphasis in these courses is on writing in these modes); and (c ),f 
their match with the kinds of speaking activities that take place in ouM 
courses, and thus their ability to serve as a basis for measurement of) 
achievement. (Copies of the tests appear in appendix A.) 

Comments on replicability: 

Materials are replicable because tests are in the appendix. The order of admin¬ 
istration of test sections is replicable. We do not know how the actual adminis-f 
tration was carried out. This might not be important for interpretation but, if! 
we wished to replicate the study, we would contact the author for precise proce¬ 
dure directions. 

In proposal writing, complete copies of materials may be needed. In the finalf 
report, since journals do not always have space for a complete copy of the mate-| 
rials, you may only be able to give a few examples and either place the materials! 
in the appendix or note where researchers can obtain a copy of the materials.! 
Similarly, detailed instructions to 5s may be placed in the procedures section in! 
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a proposal and then, in the final report, simply summarized or placed in an ap¬ 
pendix. 

o-ooooooooooooooooooooooooooo^oooooooo 

practice 4.5 

1. Evaluate the following fictitious example: 

Participating Ss assembled in a quiet room and writing materials were given 
them. In session l, participants in group A read instructions (see appendix 
A) to write an essay on the topic "The values of society are reflected in its 
popular music." A time limit of one hour was announced but no further in¬ 
structions were given. Participants in group B read instructions to write an 
essay asking neighborhood residents to support a student project to convert 
a dilapidated house into a center for foreign students, and the time limit of 
one hour was announced. In session 2, participants assembled in the same 
room and were administered a concrete-abstract word-association test and 
the Simms role-taking task. 

||bmmcnts on replicability:_ 


IOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

You may wonder about the emphasis on replicability. It is true that our field, 
perhaps, is unusual in that few replication studies are done. In other fields, the 
■reverse is usually true. Once a finding is observed, the same researcher, and 
■perhaps others, carries out replication studies to see whether the findings will be 
upheld. These studies may shift certain variables in the study or try a wider 
imultimethod approach to the study. It is assumed that if a finding is important 
ithen it is important to replicate the study to be sure of the stability of the finding. 
We have emphasized that we carry out research with the expectation of making 
an original contribution to the field. Replication studies, however, should be the 
meat of much of our work. Perhaps one reason for the present state of affairs is 
the mistaken assumption that findings can be gcncralized-usually an unwar¬ 
ranted assumption. Another may be the requirement that theses and disserta¬ 
tions make an "original contribution" to the field. Replication may not always ft 
an examiner's idea of originality. Nevertheless, replication studies should be an 
important part of our research agenda. 

dn cases where special equipment must be used, it may be important to include 
a description of the equipment or a note that the description can be obtained 
from the author. 

Chapter 4. Writing the Research Proposal and Report 115 



If the data are to be coded, the procedures for coding are usually given in this 
section. For example. 


Classroom activities will be coded using the attached check sheet (see ap¬ 
pendix). Three observers (0 { , 0 2 , and 0 3 ) will view the videotape records 
and at five-second intervals check the appropriate activity (e.g., teachet 
structuring move, teacher questioning move, teacher explication) on the 
sheet. Interrater reliability will be established as part of the development 
of the coding instrument. Observers will receive training prior to the study, 
and the training will be reviewed at two-week intervals throughout the study. 


Comments on coding description: 

Coding procedures are clear, given that the check sheet appears in the appendix; 
We understand that training and review of the training are important to establish 
consistency in ratings but we do not know how interrater reliability will be mea¬ 
sured and what constitutes "established" reliability. These are details that could 
be obtained from the author. 


OOOOOOOOOOOOOOOoOOOOOOOOOOOOOOOOOOOOO . 


Practice 4.6 


1. Comment on the following description. (The task is that undertaken by 
Flashner, 1987.) 




All videotape records have been transcribed and entered in "idea units" form. 
(see Chafe, W., & Danielewicz, J-, 1985) in a computer data file. Each of 
the 5,431 units will be individually coded for mode (oral; written). genre 
(account, recount, event cast), clause type (the seven linguistic types de¬ 
scribed above), information (given/new), context (contextualized or 
decontextualized), data source (English monolingual, fluent bilingual, lim¬ 
ited English speaker), lesson unit (four different units), and student I.D. 
This coding will be accomplished by creating "tables" using the RBASE 
(Microrim) computer program. Interrater reliability of code assignment 
will be established. 


Comments on coding description: 


ooooooooooooooooooooooooooooooooooooo 


It may seem obvious that the procedures should allow us to answer our research 
questions, but it is amazing how often something is forgotten. In preparing a 
proposal (and, of course, in reading proposals or reports of other researchers), 
look carefully at the procedure statements and check to sec that they match the 
research questions. If you label each part of the procedure with the number for 
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the queslion(s) it answers, you will know for certain that the procedures cover all 
nf the questions. 

O^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO^OOO 

Practice 4.7 

I for the article you chose oft page 110, critique the materials and procedures 
part of the method section. Are the descriptions precise enough to allow for 
replication of the study? 


2. Review the research questions posed and annotate the method section to show 
|how each research question will be answered. 


ooooooooooooooooooooooooooooooooooooo 

Analysis 

Review boards often comment that the analysis section is the "weak link" of most 
proposals. This should not be the case if you have clearly stated research 
questions, have identified your variables appropriately, and have selected and 
;classificd the research design. The comparisons that are to be made will be clear; 
you know whether previous research suggests a directional hypothesis or not. 
You know the function of each variable in the research. You know whether the 
data for each variable is nominal, ordinal, or interval (or continuous or noncon- 
tinuous). You know whether comparisons are repeated-measures or for inde¬ 
pendent groups. These are precisely the criteria for selecting the appropriate 
method of data analysis. 

The selection of the appropriate method of data analysis will be presented in the 
following chapters. At this point, however, you have already acquired the rele¬ 
vant information you will need for determining this selection. 

In the final report, this analysis s_ection will be embedded in the results section. 
That is. in the proposal, the analysis section lists the comparisons to be made and 
the appropriate statistics to do the comparison. In the final report, all the wills 
are changed to past tense: that is, "An ANOVA will be used to determine ..." 
becomes "An ANOVA was used to determine..." However, because of the im- 
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portance of the results, this section will form a new section of the final report! 
Rather than part of "Method," it is now "Results." 


Results and Discussion 

Again, it is possible that the results and discussion section may actually be tw 0 
different sections. If the results section is especially long, it may be easier to 
present all the results of the analyses in one section and then explain and discuss 
the results in a separate section. However, if the research has not required con- ; 
parisor.s of many different types (and all the accompanying tables), then the two 
can be compressed in one section. Some journals, because of lack of space, ask 
that the sections be shortened and combined. 

When you are writing a proposal, these sections cannot be completed. It may be 
possible, on the basis of the proposed analysis, to speculate on possible outcomes 
and interpret what various findings would mean if they were obtained. This is 
good preparation for writing the final report. However, many grant agencies and 
university faculty committees believe that this should be left until results are in: 
since speculation ahead of time may lead to bias in interpretation. That is, if 
you've already decided what the results (should they confirm or should they dis- 
confirm your hypotheses) mean, you may not be able to interpret accurately the; 
results that do obtain. 

There arc other differences between a proposal and the final report. In a final 
report, the researcher has had the opportunity to think about other ways the 
genera: research question might be addressed. Particularly in dissertations and 
theses, researchers discuss other possible research projects that might contribute: 
to answering the general question. In fact, the discussion sections of dissertations 
are good places to look for research ideas. 


Front and Rack Matter 

In addition to the body, the research reports must include material that publish-; 
ers call front matter (usually, a title page and abstract) and back matter (refer¬ 
ences, appendices). 


Title Page. Most proposals as well as final reports have a title page. The full title : 
is given (a shorter title may be used as a running title at the top or bottom of the 
proposal itself) The researcher's name and affiliation is given on a separate line. 
If a date is relevant, this may also be given. Frequently, a statement appears on 
the title page to identify the institution or organization to which the proposal is 
being submitted. 

The title should have three characteristics: it should be brief, it should be clear, 
and it should sell the project to the reader. Look at the following examples. 
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A cA’v:> iption of the similarity of gender encoding errors of French and 
Spanish students for quantifiers, adjectives, and pronouns and the form and 
placement of object pronouns 

Shorter version: French ana Spanish gender: similarities in errors of foreign 
language students. 

An investigatory analysis of several test approaches to discover which con¬ 
sistently measure language dominance in bilinguals 

Shorter title: Testing language dominance 

The title should be clear. The TESOL organization advises those who wish to 
submit papers to the conference to remember that the title will often be the only 
criterion a conference goer uses to decide which sessions to attend. Similarly, it 
may be the only criterion a reader will use to decide whether or not to read an 
article or a grant agency to decide whether or not to consider your proposal. On 
the other hand, the title docs not necessarily have to be dull. It can interest the 
reader as well. 

Ijtdch of the following titles are clear and short. Which would you rather read (or 
which would you select for a conference paper)? 

The unitary language factor hypothesis in light of recent developments <d 
IRT 

A final nail in the unitary language factor coffin: recent developments in 
IRT 

Some fields have had a period where clever titles (e.g., child sayings such as "Wait 
for Me, Roller Skate" for a study on directives, or puns such as "The Chicago 
Which Hunt" for a book containing papers on relative clause formation) were 
common. Can you imagine how difficult it would be to do a literature search 
with key words from such titles? If you are tempted to use a clever title, think 
first about how colleagues with similar interests will find your work. Second, if 
you are a graduate student, it might be wise to consult with your department to 
§*et their approval of titles prior to submitting your proposal. 

Most dissertations and theses use the following order for the rest of the front 
matter: dedication, table of contents, table of figures, table of tables, acknowl¬ 
edgment page, CV, and the abstract. Notice that this particular dissertation form 
places the CV in the front matter while most grant proposals require that it be 
placed in the back matter. 


Abstract. Abstracts are exceedingly difficult to write. The first thing to do is to 
find out what restrictions there are on the number of words allowed for the ab¬ 
stract. Then, think about the most concise statement of your research question, 
how to identify the 5s and procedures as briefly as possible, how to summarize 
the findings, and also show the relevance of the study. 
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Much information, of course, will not appear in the abstract. The reader (if -fie 
abstract is interesting) will turn to the article to fir, .his information. If the ar¬ 
ticle is organized in the usual format, the information will be easy to find. I hat s 
one of many reasons why journals adopt a research format and ask contributors 
to use it. 

Consider the following abstract: 

C. Haas & J. R. Hayes. 1986. What did I just say? Reading prob¬ 
lems in writing with the machine. Research in the Teaching of Eng¬ 
lish, 20, 1, 22-35. Sixteen computer-writers were informally 
interviewed about how they used the computer for writing tasks. 

While the writers felt that the computer was useful for their writing 
tasks, they also indicated that writing with the computer had disad¬ 
vantages. They reported difficulty in locating information, detecting 
errors, and difficulty in reading the texts critically. Three exper¬ 
imental studies were conducted to compare the performance of col¬ 
lege students reading texts displayed on a computer terminal screen 
and on a printed hard copy. Findings indicate that visual/spatial 
factors influence locational recall, information retrieval, and appro¬ 
priate reordering of text. Copyright 1986 by the National Council 
of Teachers of English. Reprinted by permission. 

Comments on information missing in the abstract-information that you would 
need to look for in reading the article: In what sections of the article would you 
expect to find each piece of missing information? 

1. Operational definition of terms (end of Introduction) 

2. Information on the 16 5s. Are the 5s the same in the informal interview and i 
in the three experimental studies? (Introduction and Method section, Subject : 
heading) 

3. Research design (Introduction and Method, Procedures heading) 

4. Statistical procedures (Results section) 

5. Software program used (Introduction and Method, Procedures heading) 

6. Types of writing assignments (Introduction and Method, Procedures heading) 

Comments on the title: Catchy title, informative. 


ooooooooooooooooooooooooooooooooooooo 

Practice 4.8 

1. Comment on the following abstract: 

C. Daiute. 1986. Physical and cognitive factors in revising: insights 
from studies with computers. Research in the Teaching of English, 

20, 141-159. This article argues that writing involves the complex 
interaction of parallel processes, in this case physical and cognitive 
processes. Phase one of this study indicates that the writing instru- 
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merit can affect the writing process. Junior high students using u 
computer word processor corrected more errors when they used the 
word processor than when they used pen. The computer word 
processor, however, was not used for more expanded revising activ¬ 
ities. Phase two of the study contrasted the physical aids of the 
computer word processor with the direct cognitive aid of a revision 
prompt program. Ss who used the prompt program revised the drafts 
more closely and extensively than the students who used only the word 
processing program. 

\Vhat information is missing in the abstract. Where would you expect to find the 

information in the article? 


Comments on the title: 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Abstracts written for proposals differ from those written for journal articles. For 
one thing, they arc usually longer. The abstract of a proposal will stress the 
possible outcomes of the project while an abstract for a journal will emphasize 
final findings. 

Abstracts written for conference presentations may also differ from those of 
journals. Like a proposal abstract, they must "sell" the research to a review 
board. Writing such proposals is an art (or does it develop with practice?). It is 
always a good idea, when submitting an abstract for a conference or to a grant 
agency, to ask someone with experience to review your efforts and give you ad¬ 
vice. 

Journal abstracts are written for readers perhaps even more than they are for a 
review board. Most professionals have little time to read every article in journals 
relevant to their fields. Instead, they scan titles and from these select those that 
look promising. They read the abstract as a summary of the project. If the ab¬ 
stract promises important information relevant to their interests, then they will 
take the time to read the whole report. Well-written abstracts are extremely 
helpful to the profession. 


References. At the end of the body of the document, you will find a list of refer¬ 
ences that the author has consulted. In a research proposal, and particularly for 
dissertations and theses, these reference lists may be extremely broad. I hey may 
■even be annotated and grouped around topic areas. Such references demonstrate 
that the researcher is well prepared to carry out the research. In journal articles 
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and most reports, the list will include only those icfetences actually cited in the 
article. The actual formal used for references should be consistent. Journals (an<|; 
most university departments) specify the format to be used. It is extremely: 
helpful if the references in the proposal and those in the final report can be pre- 
pared in exactly the same form. 


Appendices. Appendices may be used to give information that, because of lcagtri, 
is difficult to incorporate into running text Typically, these include descriptions j 
of materials, measurement instruments, and special equipment when these are j 
original and published descriptions are not available to the reader. They may 
also give additional information on procedures or information on the status of; 
5s or schools used in the research. They almost always contain a sample release • 
form used to get permission from 5s, schools, parents, or authors to use the data : 
source. 

In addition to these general appendices, you may decide to place all the tables 
and/or all the figures in an appendix. Conventions such as "Insert Table I here" 
are used to mark the point at which the author intends the tables or Ilgurcs :o : 
be read. 

In some cases, you may find a data appendix. More likely the text will include 1 
a note that complete copies of the data can he obtained from the author, the 
agency, the school district, or the university. I he appendix may, however, show 
sample data anti the notation used in coding the data. Do not be surprised if an 
editor decides to cut many of your appendices for lack of space. 

Indices arc seldom required unless the report, or dissertation is to be pub'ished if 
book form. This can be a horrendous or an easy task depending on whether or 
not you have access to a computer program that allows you to create automatic 
indices. Budgets, time tables, and researcher CVs are required for many pro¬ 
posals. If they are required, the best place to put them is in a special appendix. 
Appendices can also include details on availability, cost, and ordering of com-: 
puter programs, particularly if a program is exotic or written especially for the! 
study itself. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 4.9 

I. For the article you selected to review on page 110, check the reference section. 
What format was used for the references? Are the references complete and is the 
form consistent? 
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2. Critique the abstract of the article. Is it an informative summary of the arti¬ 
cle'? If the author were allowed another 25 words, what additional information 
should be included? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


Conclusion for Part I 

This concludes part I, the section of this book which was written to help you plan 
a research project. We have defined research questions, discussed procedures for 
gathering data in terms of design, and the possibile threats to internal and ex¬ 
ternal validity associated with designs. Once you have completed the following 
practice, you will have completed the basics needed to write a project proposal. 
In part II, we will turn to initial ways of describing the data, the findings that are 
offered as evidence related to the research questions. 


^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 4.10 

1. Select one of your own research questions and carry out each of the following 
activities. 

a. Select a title for one of your own research questions. Ask members of 
your study group to critique it. What suggestions did they make for im¬ 
provement? 

b. Write a few opening sentences for the research proposal (or report). If 
necessary, look at the samples given in this chapter on page 109. 

c. Describe the data source (5s, texts, or whatever) and the techniques 
used for data sampling. 

d. Write a description of materials to be used in the study. Ask your 
study group members to critique the description for replicability. 

e. Write out a procedures statement. Again, ask your study group mem¬ 
bers to critique the statement for you in terms of clarity and replicability. 

f. Write an abstract for the research proposal. Ask members of your 
study group to critique it. What changes did they recommend? Did you 
agree? Why (not)? 
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g. Describe the requirements regarding back matter (c.g., appendices 
budgets, CVs) anti front matter (e.g., title, abstract, table of contents) for 
your institution. 

h. Draw a box diagram for your study in the space provided. Fhen fill .n 
as many items as possible in the following chart. (You arc not yet prc-| 
pared to fill in the entire chart.) 


Research hypothesis? _ 

Significance level? _ 

l- or 2-tailed? _ 

Design 

Dependent Variable (s)? _ 

Measurement? _ 

Independent Variable?s)? _ 

Measurement? _ 

Independent or Repeated Measures? 

Other Features? _ 

Statistical Procedure? _ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Activities 

1. S. Jacoby (1987. References to other researchers in literary research articles. 
English Language Research Journal, /, University of Birmingham, Birmingham, 
England) investigated the ways in which researchers showed themselves to be the 
"humble receivers of tradition" and also "irreverent pioneers" who argue for the 
originality and rightness of their research. Being a receiver of tradition means 
that researchers build on the work of those who have gone before them. Rc-: 
searchers must show how their work relates to this tradition. Previous work :s 
cited and traditional arguments credited. At the same time, as the irreverent pi¬ 
oneer, the writer must also argue against tradition, claiming new territory and 
new directions for research so as to establish the originality of his or her research. 
Jacoby's article considers how literary researchers accomplish this task. Outline 
a research project which would investigate this same issue using research articles 
from applied linguistics rather than literary research. (What other "English for 
special purposes" research topics can you propose that relate to how researchers 
accomplish report writing?) 
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2 . J. K. Burgoon & J. L. Hale (1988. Nonverbal expectancy violations: elabo¬ 
ration and application to immediacy behaviors. Communication Monographs, 
55, 58-79.) carried out a study to test the notion that the road to success lies in 
conformity to social norms. They found there arc times when violation of norms 
lias a favorable as opposed to a detrimental consequence. The introductory sec¬ 
tion of this article is actually longer than the method, results, and discussion. 
This often happens in research articles. Read the article and explain why this is 
so for this study. 

3 . M. Tarantino (1988. Italian in-field EST users self-assess their macro- and 
microlcvel needs: a case study. English for Special Purposes, 7, 33-53.) gathered 
information on the linguistic problems of learners who use English in their pro¬ 
fession (professors and researchers in different fields of physics, chemistry, and 
computer science, all of whom were collaborating in international projects). The 
|questionnaire had three aims: (I) to classify the Ss on demographic variables, (2) 
to obtain information on experience which may have influenced language profi¬ 
ciency, and (3) to investigate their perceived competence in all skills in relation 
to their EST (English for science and technology) needs. Examine the question¬ 
naire given in the appendix, grouping the items according to these goals. How 
might one set up a replication study for persons enrolling in an EST course (stu¬ 
dents who hope to become professors or researchers on such projects)? Wha* 
other questions might you want to ask? 

4 . C. Kessler & M. E. Quinn (1987. Language minority children's linguistic and 
cognitive creativity. Journal of Multilingual and Multicultural Development, 8, 1 
& 2, 173-186.) compared the performance of children in two intact classrooms: 
one group of monolingual and one group of bilingual sixth graders. The bilingual 
children outperformed the monolingual children in the quality of hypotheses they 
generated in problem-solving tasks. Their use of complex metaphoric language 
was also rated higher. Review the sections on Subjects, Procedures, and Mea¬ 
sures. Given the descriptions, could you replicate the study with other groups of 
monolingual and bilingual children? If you carried out a replication, what part(s) 
of the study would you keep and which would you change? Why? 

5. J. Lallerman (1987. A relation between acculturation and second-language 
acquisition in the classroom: a study of Turkish immigrant children born in the 
Netherlands. Journal of Multilingual and Multicultural Development, 8, 5, 
409-431.) rank-ordered children on their acquisition of Dutch and also on 
acculturation in order to test the relationship hypothesized by Schumann. Since 
the rank order assignment of each child on acculturation and on acquisition of 
Dutch is based on several measures, the section on data collection is relatively 
long. Read through the descriptions of the measures and the method of combin¬ 
ing several measures to achieve the rank order on each variable. Based on the 
descriptions, could you replicate this study with another immigrant group? What 
paits of the study would remain unchanged in the replication study and what 
parts would you elect to change? Why? 

6 . B. A. Lewin (1987. Attitudinal aspects of immigrants' choice of home lan¬ 
guage. Journal of Multilingual and Multicultural Development, 8, 4, 361-378.) 
measured the attitudes of English-speaking immigrants in Israel in relation to 
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their choice of LI or L2 for communication with their Israeli-born children. Since 
this research was conducted with a 20-parent sample, the researcher is careful to 
mention this as a limitation of the study. Note the limitations the author points 
out and consider how you might replicate this study to overcome the limitations,: 
Notice the placement of the limitations section in the report. Would you argue 1 
for placing it before the method section or in the conclusion? Why? 

7. A Doyle, J. Beaudet,& F. Aboud (1988. Developmental paitcrns in the flexi-• 
bility of children's ethnic attitudes. Journal of Cross-cultural Psychology. 19, lj 
3-18.) were faced with the problem of finding measures appropriate for testing 
ethnic attitudes of young children (kindergarten to sixth grade). Read the del 
scripdons of the procedures used with the Evaluative Attribution Task (which 
was created for this study based on the Preschool Racial Attitudes Measure II 
and the Sex Role Learning Inventory), the Crandall Social Desirability Scale, and: 
Aboud's Ethnic Constancy Test. Would these be appropriate procedures if you 
replicated the study with another group of children? What parts of the study 
would you keep and which might you change in a replication study? Why? 

8 . The journal English World-Wide publishes summaries of theses on varietiei 
of English in their fall issue each year. Read the abstracts provided in the 198|j 
issue (S, 2, 277-299). The abstracts are very brief. Select one abstract and list, 
missing information and where you would expect to find it in the actual thesis, 

9. Dissertation Abstracts has several headings related to applied linguistics^; 
Check the most recent issue and review one abstract that relates to your own re¬ 
search interests. The abstracts, here, are longer and so give the reader more in¬ 
formation. What information do you still need to find in the dissertation? Wher& 
in the dissertation do you expect to find the information? 
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part II. Describing Data 





Chapter 5 

Coding and Displaying 
Frequency Data 


• Coding and counting values 

Frequency-cumulative frequency 

Relative frequency—percent, proportion, cumulative percent 
Rates and ratios 
Coding numerical data 
•Data displays 

Bar graphs and histograms 
Pie charts 

Frequency polygons and line drawings 


Coding and Counting Values 

fin reading research reports, you will Find many ways in which variables have 
been coded--as T-units, type/token ratios, MLU (an utterance length mea¬ 
surement), test scores, and so forth. Each of these is a simple way of giving nu¬ 
merical value to a variable. These values are seldom, if ever, displayed as a list 
m numbers. Rather, the values are condensed in some way to make them 
meaningful. If we want to know how many men and how many women took a 
(particular test, we don't want to see a list of names with an M or an F after each. 
(Rather, we want a total for the frequency of each. If we want to know about 
(flexibility or variety of vocabulary choice in the writing of thirty students, we 
(don't usually want to see a list of student names followed by numbers for total 
(dumber of words and total number of unique words for each. Names and num¬ 
bers can't give us information at a glance. We want the information presented 
(in a way that allows us to understand it quickly. In this chapter, we will consider 
(the various ways frequency data can be computed and displayed to give us the 
(maximum amount of information in a descriptive display. 


Frequency 

(You will remember that nominal variables tell us how often (how frequently) 
something occurs. The data are frequencies. In text analysis we are concerned 
(with how often something occurs and with comparing the frequency of occurrence 
(across genres, or in spoken vs. written text, or according to formality register. 
(Imagine that you, like us, have an interest in the language used by children in 
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elementary school classrooms. (Research that is similar to that described in the 
remainder of this chapter was part of our research agenda at CI.HAR-Ccntcr fo r 
Language Education and Research, a center funded by the Office of Educational 
Research and Instruction. Many of the questions posed and some of the data 
samples come from this project. Preliminary findings from the project are pre¬ 
sented in Flashner, 1987.) Video and audio recordings in the classroom have 
yielded a wealth of oral language data. The children's written work and journal 
entries comprise the written language data. All the language data—both oral and 
written-have been transcribed and entered into computer files. 


■ 


Some of the students in the class are native speakers of English, some arc fluent 
bilinguals (Spanish and English), and some are classified as limited in English 
proficiency (LEP). Descriptions of numbers of subjects are usually reported as 
n, where the n stands for the number of Ss. You may, however, sometimes find 
it shown with an / since that is the symbol for "frequency." The fictitious fre¬ 
quencies for each group are: 


Classification 
English Only 

Fluent English Proficiency 
Limited English Proficiency 


Other frequencies might be used in describing these .Vs as well, such as sex, native 
language, and so forth. 


The first thing you might want to disemer is just how much languagc-perhaps : 
how many words or clauses or utterances-arc produced by the children. The : 
design box can be labeled to show how the frequencies for number of words 
would be divided for the variables of language classification (of the .Vs) and lan¬ 
guage mode. 



Do you think it might be important to separate the data of boys and girls? If so, 
we could amend the design to a three-dimensional box to allow for this compar¬ 
ison. 
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The easiest way to count the number of words is to have the computer do the 
counting. Once the data arc entered into the computer, you can ask it to count 
the number of words by reading the spaces between words. If the data are en¬ 
tered in the computer so that the data source (the 5 and the S's classification) and 
fthe mode (written or oral) are tagged, the computer can easily tally the word 
frequencies for the cells of the above design. The computer could give an auto- 
Irhatic reading for the number of clauses if each clause were entered on a separate 
line and the program can count the number of lines. However, automatic reading 
is not always possible. When you work with natural language data, you may 
need to type in and, in addition, code the data before the computer can tally fre¬ 
quencies for you. For example, if you wanted to know how many turns each of 
fthe students took in various types of classroom tasks, you would need to code the 
turns and tasks so that the computer can find and count them. Or, if you wanted 
(to know how many verbs or how many nouns or how many adjectives occurred 
(m the talk of the limited English speakers, you would need to code the data so 
(that the computer can find and count them. 

|If your research question has to do with the types of clauses children produce, 
(there is no way that the computer can automatically find clause types in the data. 
(For each clause, you would need to enter a clause-type tag and the computer 
(Would use the tag to count frequencies for each type. If punctuation marks are 
used at the ends of utterances, the computer, given program instructions, could 
give an automatic frequency count for number of utterances. If, however, you 
Wanted a tally of the speech-act function of the sentences or utterances in the 
{data, you would again need to add this information using a tag so that in¬ 
structions can be given to the computer that will allow it to do the tally. 

There are a number of computer programs available to analyze texts. Typically, 
(they provide a word list in alphabetical order and a concordance. Each entry for 
a specified word is displayed, surrounded by the words of the context in which it 
(Occurred (along with information as to where in the text the word appeared). The 
(programs also provide various statistics related to frequency of occurrence. If you 


Chapter 5. Coding and Displaying Frequency Data 131 




uNf u cnmpuiei concoidance piogiain sudi as WuulCiunclici (1988) 01 the 
Oxford Concordance Program (1980), much information can be found and tallicc : 
with little difficulty. Certain kinds of information, however, may not he so easily 
matched and retrieved. 

If there is a way not to code, much time will be saved. If you wanted to count 
the number of turns and the transcript has been entered with the speaker's name 
at the beginning of each utterance, you will not need to code turns because the 
computer cart count the number of times each person's name appears in the; 
transcript. If you were doing a study of modals, it would be less time-consuming 
to ask the computer to find and tally the frequency of can. may, might, will, anc 
so forth, rather than going through the data and marking each with a modal code 
and then using the code to pull up the examples. 

Once the data arc coded, the computer will tally and give you the total number 
of observations. This total is abbreviated as n. If the total is subdivided into 
categories, then /V is used for the grand total and n is used to denote the number 
of observations in each category. Some researchers prefer to use the symbols for 
frequency,/and F , instead. 


ooooooooooooooooooooooooooooooooooooo 

Practice 5.1 

I. Assume the computer gave you frequency information for numbers of words 
and frequency of each clause type for the monolinguals, the fluent bilinguals, and 
the LEP children in the study on page 130. Precisely what would be shown in 
each of the cells? 


What would this tell you? How useful is such information? 


What figures do you feel would be more important? Why? 


2. If you wanted to do a study of frequency adverbs in the speech and writing 
of these fourth-grade Ss, how might you get the computer to do the counting, 
without codes?___________ 


3. If you had entered natural language data of 5s and wanted a tally of number 
of turns for each speaker, how might you exclude names mentioned during talk 
from this count? 
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; 4. Look at the variables you have identified for your own research questions in 
ii previous chapters. Which of these will be coded with frequency counts? 


Can the counts be automatically accomplished, or must the data first be coded 
so that you or the computer can find the instances to be counted? 


It hardly makes sense to enter data into the computer if it takes as long or longer 
than it takes to tally the data by hand. How will you decide whether to do the 
counts by hand or with the computer? 


tooooooooooooooooooooooooooooooooooooo 

Cumulative frequency 

;■ Cumulative frequency figures are presented when we wish to show a group of 
I; frequencies and the overall contribution of each to a total frequency. For exam- 
y pie, imagine that the data base for one instructional unit in Flashner's work 
contained 1,293 clauses produced by the fourth-grade students. The data might 
be presented as follows: 


Clause Type 

Freq (f) 

Cum Fr 

transitive 

587 

1293 

pred nominative 

197 

706 

pred adjective 

155 

509 

question 

86 

354 

possessive 

81 

268 

locative 

74 

187 

intransitive 

64 

113 

purpose 

18 

49 

existential 

13 

-» 31 

passive 

10 

Sr* 18 

imperative 

8 <— 

—^ 8 


To obtain the cumulative frequency, the data are arranged in order of frequency 
(the most frequent at the top of the list). The least frequent clause type is im- 
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pcrative with an /of 8. The next clause type is passive with an/of 10. The cu¬ 
mulative frequency at this point is 18 (8 + 10 as shown in the arrows above). 
As each clause type is added, the cumulative frequency rises until the grand total 
F (1293) is reached. A formula for cumulative frequency, then, is: 

cum f ~ successive additions of frequency 


OOOOO00*000000000000000000000000000000 

Practice 5.2 

► 1. Show the frequency and cumulative frequencies for clause connectors in a 
data set containing a total of 490 connectors. The frequencies of each are as fol¬ 
lows: after 4, and 72, as 1, because 50, but 15, how 14, if 28, like 11, okay 12, or 
12, introductory preposition 52, so 3, that 42, to 89, well 5, what 10, when 50. 
where 13, who 9, why 1. 


Connector f cum f 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

If you were interested in the differences in oral and written data, you would be 
less interested in the overall frequency of these clause connectors than in how the 
distribution compares in oral and written modes. 

Another possibility for study of classroom discourse analysis would be to code the 
clauses according to what Heath (1986) calls classroom gcnrcs--kinds of language 
patterns shown in school-based behaviors. Flashner used Heath's genre catego¬ 
ries for teacher-student interactions and Phillips' (1983) categories for student- 
student interactions. A brief definition of each is given below. 
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Term 

label quest 
meaning quest 

account 

event cast 

recount 


Definition 

student response to teacher's question for name or attribute 
(e.g., "What is this called?" "A rubber band.") 

student response to teacher's question rc meanings (e.g., "Does 
anybody know what an interview is?" "Uhh, when you talk to 
somebody and ask questions.") 

student gives information unknown to hearer or reader (e.g., 
show-and-tcll or a personal diary entry) 

student talks about an act or event in progress or as it will 
happen 

student retells information already known to the listener 


student question student asks for information 


Heath believes that label quests, meaning quests, and student questions are the 
three genres basic to classroom interaction. The other genres, in contrast, are 
those that require higher order manipulation and integration of information. 
Thus, it would be possible to group the clauses as having either higher or lower 
cognitive demand. 


Phillips' genres for student-student interaction include the following (among oth¬ 
ers): 


Term Definition 

argument student adopts alternative point of view 

exposition student acts like teacher in structuring a conversation (e.g., "Okay, 
yard slips, how many yard slips are there?") 

hypothesis student offers speculation or suggestion (e.g., "I think the best thing 
is to take out 'computer.'") 

operation student offers a running commentary on activity 


ooooooooooooooooooooooooooooooooooooo 

Practice 5.3 

Imagine that the total for clauses in Flashr.er's data set was 1,153. Here are the 
frequencies for clauses ir. each of the above genres: account 476, event cast 348, 
label quest 112 , meaning quest 11, recount 22 , student quest 18, argument 12 , ex¬ 
position 17, hypothesis 16, operation 121 . 
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► 1. Prepare a frequency and cumulative frequency table for the data: 

Genre f cum f 


2. Total the clauses that reflect Heath's basic interaction and those requiring 
higher order manipulation or integration of information. Docs there seem to he 
much variability in the frequency of each clause type within each of the two cat¬ 
egories?_ ___ 


3. !n which gcnrc(s) do the children produce the fewest clauses? How would you 
account for this?_ 


4. In which do they produce the most clauses? How do you account for this? 


ooooooooooooooooooooooooooooooooooooo 

Relative Frequency 
Percent and Proportion 

When you have only a small amount of data, it makes sense to present the raw 
frequency totals. It is easy to see how important each frequency is relative to the 
rest of the data. However, when numbers are large and/or when there are many 
categories, it is often more informative to show the relative frequency of each: 
category as proportion or percent. 


Proportion = 


number of X 
total 
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Percentage — 100 x proportion 


Imagine that you have entered all the data from the fourth-grade class and 
tagged the types of clauses used by the children. You wonder how the kinds of 
pauses children use differ in oral and written modes. Ultimately, you want to 
know whether the language produced in these two modes shows the same kinds 
lof differences in clause structure as found for adults in other studies. To make 
These comparisons with previous studies, you might count and display the distri¬ 
bution of the same clause types as those reported in the research on adults. The 
Computer can tally these as raw frequencies, but it can also show you a percent 
|for each clause type in all the written clauses counted. It can do the same thing 
Ifor the oral language sample. For example, in coding the data for purpose 
clauses, adverbial clauses, and so forth, the raw frequencies look like this: 


Clause Type 

Oral 

% 

Written 

% 

Complex 

44 

6.5 

151 

32.2 

Complcx/Coordinatc 

2 

.5 

4 

0.8 

Coordinate 

27 

4.0 

16 

3.4 

Fragment 

194 

28.0 

7 

1.5 

Simple Finite 

352 

51.5 

257 

54.8 

Simple Nontlnite 

51 

7.5 

33 

7.0 

Nonmatrix Subordinate 

14 

2.0 

1 

0.3 

Total 

684 

100 % 

469 

100 % 


|(Some of the percentages have been rounded to total 100%.) Note that we cal¬ 
culated the total number of oral clauses and then calculated the percent of that 
ftotal for each clause type (and the same thing for written). You could also com¬ 
mute the percent of oral vs. written clauses for each clause type. The direction 
| depends on the research question. If you want to know how many of the complex 
Clauses in the data appear in written vs. oral data, the number of fragments that 
foccur in written vs. oral data, and so forth, then the second method makes sense, 
flf you want to know about the overall distribution of the clauses within oral data 
|and then within written data, you would select the first method. 

Another common use of percent or proportion in second language research is the 
fuse of "percent correct in obligatory instances." In such studies, a feature (say, 
the -s inflection of third person present tense) is to be evaluated for correctness 
in natural language data (oral or written). Rather than tally the number of -5 
inflections produced in a piece of text by the learner, the researcher first goes 
through the text and identifies all the places where the -5 inflection would be used 
by a native speaker of the language. These are the "obligatory instances." Then 
the S is given a score for the correct uses of -s, expressed as a percent of the total 
number of obligatory instances in the text. Simple open-ended tallies of features 
can be changed to closed-group percentages in this way and comparisons can 
Then be drawn across learners or groups of learners. That is, frequencies have 
been changed to percentages, moving measurement from nominal to continuous 
idata. 
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Pcicenis and piopoi lions aic exactly the same except that pci cents are psesemeii 
as whole numbers (c.g., 68%) while proportions are presented as decimals (e.g. ( 
.68). It is important to give the reader a feeling for the frequencies as well as the 
percent ir proportion. That is, a figure of 80% could represent 4 instances out 
of 5 or it could represent 80 out of 100. To overcome this difficulty, you might 
place the frequency data in the tabic and add the percent or proportion in pa* 
rcnthcscs next to each figure. Another, perhaps more common, solution is to 
present the percent or proportion and then note the «--thc total number of ob¬ 
servations on which the percent is based. 


Cumulative Percent 

Just as tables sometimes include a column showing cumulative frequencies, tables, 
may inc.udc a column for cumulative percent. Cumulative percent is computcc : 
in precisely the same way as cumulative frequency. 

cum J = successive additions of frequency 

cum % = successive addition of percent 

We c:nil arrange the data lor clauses in the oral mode like this: 


Oral Mode 


Clause 

% 

Cum % 

Simple Finite 

51.5% 

100 .0% 

Fragment 

28.0% 

48.5% 

Simple Nonfinite 

7.5% 

20.5% 

Complex 

6.5% 

13.0% 

Coordinate 

4.0% 

6.5% 

Nonmatrix Subord. 

2 .0% 

2.5% 

Complex/Coordinate 

0.5% 

0.5% 


Interpreting Percent 

One would think that the interpretation of percent would always be straightfor* | 
ward. Indeed, that is usually the case. However, there arc a number of cautions j 

to keep in mind. First, we need to know' how large the n size is in addition to j 

knowing the percent. If the data on clause types were subdivided for the three * 
subject levels (EO-English only, FEP-flucnt bilinguals, and LEP), the form of j 


the tabic would change. 


oral 

written 



EO FEP LEP TOTAL 

HO FHP LHP TOTAL 
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L It would now be even more important to show raw frequencies as well as pro- 
| portions if the data were presented in this way (rather than collapsed for all Ss) 
l because the frequencies might be small. You will notice that the total number 
ii of clauses for this instruction unit was fairly large to begin with, but when it is 
a divided into frequencies for each group, the n size drops. It is unlikely that equal 
iy numbers of clauses would be produced by the children in each of the three 
i groups, but with the small n, the proportions might look very different while the 
| actual raw frequencies might be quite close. For example, there arc only 2 
I complex/coordinate clauses in the oral data. If both were produced by EO chil¬ 
li dren, then the percentages would be 100%, 0%, and 0%. Somehow 100% and 
I; 0% seem to differ more than the raw numbers 2 and 0. 

There is a second problem in interpreting percents. Notice there were only 14 
nonmatrix subordinate clauses in the oral data. If 10 were produced by children 
in the LEP group, 4 by the EO group, and 0 by the FEP group, the percents 
|would be 71%, 29%, and 0%. This might seem very informative until we re¬ 
member that the number of EO, LEP, and FEP students is not equal to begin 
| with. If the size of the groups was unequal to begin with, we should not expect 
fequal numbers of clauses from each. 

|A third problem has to do with the amount of data available for interpretation. 
| If it is the case that children in the LEP group contribute less data to the total 
if data base, should this concern us? That is, if they produced more data, is it likely 
that the proportion of clause types in their data would change? This is a matter 
|of real concern when we turn to interpretation. 

fit is also unlikely that the number of clauses produced in written and in oral 
fmodes would be exactly the same. Again, this should cause concern. That is, if 
the number of clauses in each mode were equal, would the proportion of clause 
types change for each mode? One way to solve this problem would be to select 
randomly 1,000 clauses from all of the oral data and 1,000 clauses from all of the 
written data and use these data for the calculations. 

We do, of course, want to overcome as many threats to reliability and validity as 
possible. Equating the sample size and using random selection are usually good 
ways to do this. However, you can see that equating the number of clauses for 
each cell of the design might be problematic if we also want to search for infre¬ 
quent structures. The possible threat to the study (because of unequal ns for Ss 
or for clauses in the written and oral categories) has to be weighed against the 
possibility of dealing with the threat means eliminating data you need. To deal 
With the threat, we are trying to change open-ended frequency data into dosed 
continuous data (for example, the number of purpose clauses in a closed set of 
1,000 randomly selected clauses). This conversion may not be warranted, par¬ 
ticularly if it endangers the study by deleting the few clauses of certain interesting 
f types that may appear somewhere in the data (but not in the 1,000 randomly se¬ 
lected clauses). 
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OOOOOOOOOOOOOOOOOOCC'C-oooooooooooooooo 


Practice 5.4 

► I. Using the percent figures given for the written data on page 137, calculate 
the cumulative percents and fill in the following table. 

Written Mode 

Clause Vo Cum % 


ooooooooooooooooooooooooooooooooooooo 

Rates and Ratios 

Relative frequencies are most often presented as a proportion or percentage of a 
total. However, sometimes, the frequency is to be compared relative to sonic 
number other than the total number of observations. The most common of these 
comparisons are rates and ratios. 

Rate is used to show how often something occurs in large data sets. 

Rate = relative frequency per unit 

It may not be that what you want ’.o discover is "one in a million," but rather one 
in ten" or "one in a hundred." The first thing to consider is the unit, the 'per X" 
unit. For example, students whose first language is not English arc usually given 
a language screening test on entering the school system. The students are then 
classified as English dominant (limited LI skills), fluent bilingual (equal L1L2 
skills), limited English proficiency (limited L2 skills), or non-English-speaking (no 
L2 skills). These classifications are used to determine the best curriculum for in¬ 
dividual children with differing LI and L2 skills. The information from such a 
screening in a large school district might be reported as in the following fictitious 
data: 
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LI Group 

N 

Class 


Rate/100 

Spanish 

4,750 

EO 


23 



Fluent 

L1/L2 

38 



LEP 


22 



NEP 


17 

Korean 

2,688 

EO 


35 



Fluent 

L1/L2 

50 



LEP 


3 



NEP 


12 

Vietnamese 

1,200 

EO 


5 



Fluent 

L1/L2 

53 



LEP 


8 



NEP 


34 


In a smaller school district, of course, a unit of per 100 would not make sense. 

Rate might be useful in our fourth-grade study if we were interested in the fre¬ 
quency of clause types that are thought to reflect higher cognitive skills. Hy- 
:potheticals (e.g., ifjthen, if not/then, unlessjtheri) and causals (e.g., because) are 
[examples of such clause types. Assume the clauses have been tagged in the data 
;so that the computer can tally how often these clauses occur. Do you imagine 
{they might occur "once per 10 clauses?" If so, then 10 is a good unit to select. 
{A unit of 100 or 1,000 clauses might not seem suitable. 

{While there is no standard unit, many text studies use "per 1,000 words." While 
|per 1,000 words" appears in most published articles on text analysis, this does 
{hot mean you must accept this as a standard for this study. You could use per 
{100 words, per 100 clauses, or per 1,000 clauses. The decision depends on the 
nature of the data. If you are investigating clause types, then the unit should be 
"per X clauses". However, if the data are not entered as clause units, you may 
be forced to use "per word." 

Assume that we found that the rate for most of these "cognitively complex" 
clauses was 5 per 100 clauses. Also assume that few of these clauses were used 
by LEP children. Still, it is possible that it is not L2 proficiency but rather 
classroom task that influenced the frequency of such clauses in their data. That 
is, all the students may have the ability to produce these cognitively complex 
{structures, but may not have had the opportunity to do so. To check to see if this 
is the case, we might work out an activity we think would encourage the use of 
such clauses. The activity selected is a board game similar to "Johnny, can I cross 
the ocean?" where permission is given or denied using such phrases as "yes, if 
/xxx", "not unless xxx," or "no, because xxx." With this task, the number of such 
structures is dramatically increased for all 5s. With such a mini-experiment in¬ 
troduced into a natural school setting, you can substantiate that all groups of 5s 
{are able to employ such structures appropriately but that they elect not to in 
{Certain classroom tasks. 
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Imagine, too, that you arc intrigued by the students' abilities to poultice 
cognitively complex clauses in this mini-experiment. Such clauses arc considered: 
very difficult. In fact, they are believed to he typical of the language of cognitj\ c 
academic language proficiency" (Cummins' CALP, 198-1)—the language needed 
for academic success. The mini-experiment took place in what would normally; 
be thought of as "basic interpersonal communication" (Cummins' PICS) rather 
than in an academic task. In order to substantiate the notion that more of these 
clauses occur in academic tasks than in basic social interaction, you could do a 
survey of text structures in science, math, and social studies reading materials ter 
the fourth grade. These counts could be compared with those of books used far: 
nonacademic pleasure reading (such as comic books or children's novels). On the 
production side, you might compare counts of such clauses in children's diaries 
and letters as compared with their written reports in language arts, science, and 
social science. If you obtained the rate of cognitively complex clauses in each efi 
these situations, you could compare rates in academic and personal communi-: 
cation tasks. The table on the findings might (or might not) look like the follow¬ 
ing fictitious data: 

Frequency of Cognitively Complex Clauses 


Acad Rdg 

Rate 

Non-Acad Rdg 

Rate 

432 

4:100 

210 

2:100 

Acad Writ 

Rate 

Non-Acad Writ 

Rate 

121 

1:100 

110 

1:100 


From this table (if the data were real), you could say that the frequency of- 
cognitively complex clauses (as shown in the rates) for academic reading was 
twice that of nonacademic reading. For the novice writers-the children-you: 
could state that the rates for frequency of cognitively complex clauses were the 
same in academic and nonacademic writing. The project breaks down when we 
try to compare the rates in the other direction. The notion that children produce 
fewer clauses of high cognitive complexity when compared to adults who are 
professional writers should not surprise anyone. If we try to talk about the cog¬ 
nitive demand of children s reading compared with children's writing, then the 
measure of cognitive demand (and our operational definition of cognitive de¬ 
mand) would be the production of such clauses. "Cognitive demand" would not 
include the cognitive demands of the composing process in writing vs. compre¬ 
hension processes in reading—a very knotty area. Whatever we said about the 
comparison would have to be carefully limited so that readers would not interpret 
the statements of cognitively complex clauses as equivalent to cognitive demand 
in carrying out tasks. 

While we usually think of rates with units of "per 100" or "per 1000, it is also 
possible to have a rate "per I." One type of rate frequently used in applied lin¬ 
guistics research is the number of words or of S nodes (sentence nodes) per T- 
unit. These arc measures of syntactic complexity. Roughly, a T-unit can be 
defined as an independent clause with any attached dependent clauses. Thus the 
utterance Mari went to Chicago and then came to visit us would contain 2 T-units 
(Mari went to Chicago; (Mari) then came to visit us). T he utterance Mart, who 
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n ever loved flying to Chicago in winter, came to visit us would contain only 1 T- 
unit. In this second utterance, the number of words per T-unit (13) is greater 
than the number of words per T-unit in the first (4, 5). In the first utterance, 
there is 1 S node for each T-unit ( Mari went to Chicago—l T-unit and 1 S node; 
Mari came to visit W5—1 T-unit and 1 S node) and in the second, there are 2 S 
nodes ( Mari came to visit us; Mari never loved flying to Chicago in winter) in 1 
T-unit. When learners produce many words per T-unit or S nodes per T-unit, 
vve can (if we accept "words per T-unit" as a good operational definition of com¬ 
plexity) infer that the syntax is complex. 

Both rates and percents give us a way of comparing frequencies. Raw frequencies 
usually are not comparable because they are basically open-ended. By converting 
frequencies to rates or percents, wc change them to closed-group frequencies 
which can be compared. 

Ratios are also useful in displaying information about frequencies in relation to 
each other. There are a number of ratios we might want to display for the ele¬ 
mentary school study mentioned earlier. For example, we might want to know 
the ratio of boys to girls in LEP classification in the total school district. If there 
were SO girls and 360 boys, the ratio would r>c: 



ratio = .22 


;We cou.d say the ratio of males to females is I male to .22 females. However, 
we don't usually talk about .22 of a person, so it is easier to multiply this by 100 
and say the ratio is 100 to 22. 

There are many other ratios we could display for the fourth-grade data. Since 
Goodlad's A Place Called School (1984) has shown that high school students 
have little opportunity to talk in school and since talk is important in language 
learning, we might want to know how the opportunities for talk vaiy across 
classroom tasks in the fourth-grade study. Some of the data collected are from 
teacher-centered activities. The teacher structures the interaction, calls on indi¬ 
vidual children for their responses, evaluates these responses, and poses more 
questions. She may also summarize the children's individual responses. In other 
tasks, the students arc engaged in cooperative-learning activities. An example 
might be where children have to solve a problem and each child has some piece 
of information that others in the group do not possess. To solve the problem, 
each of these pieces of information must be presented and discussed. Kagan 
(1985) calls this the jigsaw technique. The teacher sets up the task but the stu¬ 
dents carry it out. We have the number of turns of talk for the teacher and for 
each of the individual students in teacher-centered and in cooperative-learning 
tasks. While we want to know how many turns children in each of the three 
language groups (EO, I.HP, FEP) have in these two types of classroom activities, 
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we are also interested in the number of turns in relation to the number of turns ^ 
the teacher takes. Therefore, we could compute a ratio of tcacher-to-student talk 
for each of the groups on each task. 

Ratios are also used in the analysis of questionnaire and test information where 
the likelihood of falling into one group or another is compared. For example, the 
number of 5s who pass or do not pass the French proficiency test might be pre¬ 
sented as an odds ratio. If 350 passed and 125 did not, the ratio is 
350 -f 125 = 2.8, almost three to one. If travel in France is included in the de¬ 
mographic data, then the odds ratio for pass vs. not pass can also be calculated. 
Perhaps the ratio for passing the lest would be 4 to I for the + travel group ar.<j 
1.5 to l for the —travel group. 

Undoubtedly the most used ratio in applied linguistics research is the type-token 
ratio. This is a rough measure of vocabulary flexibility. While an overall word 
count can tell us about how much language a person can produce in some amount 
of time, the ratio of the number of unique words to the total number of words 
gives us a better notion of vocabulary use. The computer can easily count the i 
total number of words and the number of unique words for each child. These 
ratios can then be compared across the three groups of children—EO, FEP, ard 
LF.P. I 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO : 

Practice 5,5 

I. What prediction would you make for the ratios in the teacher-centered in¬ 
struction? Do you think the ratios will be equal for all three groups of children? 


How similar do you imagine the ratios of tcacher-to-student talk for the three 
groups would be in cooperative learning?__ 


How similar do you predict the ratios would be between teacher-ccntered and 
cooperative-learning tasks? 


2. What predictions would you make about the type-token ratios for the children 
for oral and for written data? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 
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|Coding Numerical Data 

!|jl discussing simple computations used to tally and compare frequency data, we 
tfiavc used examples drawn almost exclusively from natural language-data where 
|these computations are difficult or at least time-consuming. With elicited data- 
| for example, data from experiments or from tests—the process is much easier. The 
•researcher can enter information on each student or subject on a data sheet and 
[enter numbers for most responses. The data do not have to be transcribed and 
ft he coding is much easier. 

[Suppose that teachers at the school where the fourth-grade research was being 
l done attended an in-service workshop where the speaker talked about the diff¬ 
iculty many children experience in following consecutive directions. Given an 
[ applied linguistics perspective, you might suspect that it isn't just the number of 
! steps in the directions but the order in which the steps arc given that would effect 
[ comprehension. That is, you think a direction like: 

Before you draw the triangle, turn the page. 

' might be less comprehensible than a direction: 

:• A fter you draw the triangle, turn the page. 

because the order of the clauses isn't the same as the order of required actions. 
You also think the connector itself may vary in difficulty. "Do X and then Y" 
.(c.g., "Draw the triangle and then turn the page") might be easier than "Do X 
before Y" (e.g., "Draw the triangle before you turn the page"). Each child is 
tested individually on ability to follow directions where the order of mention is 
• or is not the same as the order in which the actions are to be done, and where 
-directions use connectors such as and or and then or but first compared with be¬ 
fore or after. (You might want to complicate the study by looking at conditionals 
in directions too!) 

A data sheet could be prepared as follows. The data for each child would go on 
one row of the data sheet. At the top of each column place a label for the infor¬ 
mation that goes in that column. Columns 1-2 could hold the child's individual 
I.D. number (sometimes called a case number). The data will be easier to read if 
you leave an empty column between pieces of information (though this is not re¬ 
quired). Columns 4-5 might contain the child's age and column 7 might show the 
child's sex. You could use a 1 for female and a 2 for male. Column 9 might be 
used for classification. You could arbitrarily assign a 1 for EO, a 2 for FEP, a 3 
for LEP and a 4 for NEP. 

If the test had had 10 items, you might use the following columns for that infor¬ 
mation. A 1 might indicate a correct response and a 0 an incorrect response. 
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Sample data sheet 
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Once the information is collected, it can be entered into the computer for tallying. 
With jest a lew instructions to the computer, you can learn how many boys and 
girls participated, the number of children in each classification, and the number 
of children who responded correctly or incorrectly to each item, lest items can 
be grouped and information on the number of total correct and incorrect re¬ 
sponses for each group of items tallied. Of course, these tallies could easily be u 

done by hand since the data are all right there on a single sheet of data paper. ;; 

With large data sets, this is hard to do and of course the data arc not easily ac¬ 
cessible in natural language data bases. Once the data have been tallied, you will r 

be able to give an informative display of the frequencies. The question that in> ; 

mediately springs to mind is whether or not the frequencies support your research 
hypotheses or not. This question (i.e., how big a difference do you need to be sure 
a difference is real?) will be discussed in the chapters of part III. 

As you have undoubtedly realized, the coding and computations discussed in this 
chapter have been those used primarily for frequency data. In the next chapter 
we will discuss coding and displaying ranked (ordinal) and interval data. Before 
we do so, though, we will consider ways that the results of frequency computa¬ 
tions might be displayed to make them more informative. 
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iooooooooooooooooo^ooooooooooooooooooo 


|practice 5.6 

|. Using the data table, compute the frequency for number of male 5s._ 

iCompute the frequency for number of male 5s who answered item 5 correctly. 

|What proportion of the 5s answered item 9 correctly? _What is the ratio 

|pf LEP 5s to the other groups combined? __ 

Imagine for now that the hypotheses about the difficulty of clause types and 
§lause connectors (page 145) turned out to be supported by the data, what sug¬ 
gestions might you make at the next in-service workshop? (Don't forget our 
learnings about research validity in previous chapters!) 


ooooooooooooooooooooooooooooooooooooo 

Data Displays 

It has been said that a picture is worth a thousand words. Sometimes the picture 
is used to represent numbers as well as words. 

With sophisticated computer programs, data displays are becoming more and 
more common. In the past it was expensive for journals to have draftsmen draw 
figures and graphs with India ink to obtain 'camera ready' copy of high quality. 
This problem is no longer an issue, so it is likely that data displays will more 
frequently be used to describe data. 


Bar Graphs and Histograms 

Bar graphs are often used to show frequencies. The bars in the graph are not 
attached to each other. In such graphs, there is no special reason for placing one 
bar to the right of another. For example, males could be ordered first or second 
in a display bar graph for sex of learners. Bar graphs showing frequencies for 
LI membership are also not ordered in any special way. The graphs can be or- 
||ered as you wish. By convention, many researchers place the largest number in 
the center and build others out on either side, but this is not a rule by any means. 

When the bars in a bar graph are arranged in some meaningful order, they are 
fften connected. These are sometimes called histograms rather than bar graphs. 


Chapter 5. Coding and Displaying Frequency Data 147 






Here are demographic data reported by Reid (1987) in her survey of learning 
styles. Among many other facts. Reid reports on the length of time her .S's had 
studied English in the United States as follows: 

Time Studying English in U.S. 


Time n 

Less than 3 mos. 511 

3-6 mos. 266 

7-11 mos. 133 

12-17 mos. 131 

18 mos-2 yrs 48 

Over 2 yrs 13 

Over 3 yrs 53 


The histogram could be arranged cither from high frequencies to low, from low 
to high, or in order of length of time. 



Data are also given on the educational status of the respondents. Class: Grad¬ 
uate, 424; Undergraduate, 851. The bar graph for this information might bei 
displayed like this: 
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1000 i 



Graduate Undergraduate 


j^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

practice 5.7 

► 1. Draw a bar graph for Reid's report on the sex of her respondents. Sex: 
f/Iale, 849; Female^ 481. 


P 2. Draw a histogram for the data Reid gives on the age of her Ss. 


Age 

n 

15-19 

342 

20-24 

532 

25-29 

235 

30-34 

87 

35-39 

43 

40-44 

16 

45-49 

4 

50-54 

3 

55 + 

1 
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ooooooooooooooooooooooooooooooooooooo I 


Pie Charts 


Some researchers feel that a pie chart is the best display for proportion or percent 
displays. Reid lists the major fields of study for her 5s as follows: 


Major 

n 

Engineering 

268 

Business 

277 

Humanities 

171 

Computer science 

130 

Hard sciences 

54 

Medicine 

43 

Other 

420 


Here is a pie chart that illustrates this distribution. 

Major 



S Engr 
EH Bus 
HI Hum 
E2 CompSci 
ID HardSci 
H Med 
HI Other 
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Practice 5.8 

p i. Draw a pie chart that illustrates the proportion of male vs. female Ss for the 
data in Reid's report. 


Draw a pie chart that illustrates the breakdown for time studying English in 
|he United States (data on page 148). 


ooooooooooooooooooooooooooooooooooooo 

Frequency Polygons and Line Drawings 

With bar graphs and histograms, like items are stacked like bricks on top of each 
other to form a bar. When polygons are used, each item is not visually present 
in the stack. Instead, all like items are tallied and the frequency point is marked 
with some symbol. The symbol could be a small bullet, ®. a small square, ■, or 
whatever symbols your computer program may use. These symbols are then 
connected either with a straight line or with a curve. Some researchers do use 
straight lines to connect the points, but most people prefer a curving line. The 
shape of the distribution shown by this curved connecting line is called a 
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polygon. Polygon is an impoitaut term to remember. In future chapters we wi [ 
use it to talk about the visual shape of the data. That is, we won't say the shape 
of the bar graph or histogram" but rather "the polygon." It's important to make 
this connection now so that later you can process it with ease. 

Polygons or line drawings are appropriate when frequencies arc ordered in re¬ 
lation to each other. For example, if you want to display the ages of your 5s, j; 
makes sense to begin with the youngest age at the left and the oldest at the right. 
If you want to display the scores of a large number of students, it seems sensible 
to give the frequency for the lowest score at the left and then arrange the fre¬ 
quencies of all the following scores to the right. 

Here is a frequency distribution of TOEFL scores as reported in Reid: 


TOEFL 

n 

300-349 

2 

350-399 

9 

400-449 

64 

450-474 

74 

475-499 

97 

500-524 

120 

525-549 

104 

550-574 

73 

575 + 

63 


Here is the polygon that presents this information in a visual form: 



TOEFL Scores 

It is possible to have overlapping polygon displays. For example, assume that 
you wanted to show these TOEFL scores but you wanted the display to show the 
number of males and females at each point on the distribution. With a bar graph 
you could make two bars for each interval. With a polygon, you can select one 
symbol to represent the frequency of females and another for males anti draw two : 
lines in overlay. 
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females 



TOEFL Scores 

ooooooooooooooooooooooooooooooooooooo 

Practice 5.9 

► 1. Let's imagine that Reid collected the 101.11. scores of these same students 
one year later and that the frequencies (fictitious) were as follows: 


TOEFL 

n 

300-349 

0 

350-399 

2 

400-449 

58 

450-474 

71 

475-499 

98 

500-524 

126 

525-549 

110 

550-574 

75 

575 + 

65 


Draw a frequency polygon for this hypothetical data and place it in overlay on 
that given for the original data. 
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ooooooooooooooooooooooooooooooooooooo 


Conclusion 

In this chapter we have discussed ways in which the researcher can code, sum¬ 
marize and present information on frequency data. In the next chapter we will 
repeat this process using scaled and scored data. 


Activities 

1. G. Yan (1985. A contrastive textological analysis of restrictive relative clauses 
in Chinese and English written text. Unpublished master's thesis, TESL, UCLA.) 
compared the similarities and differences in the use and distribution of restrictive 
relative clauses in Chinese and English newspaper and journal articles, govern¬ 
ment agreements, and student compositions. Four major categories (and 20 
subtypes) identified by Celce-Murcia and Larsen-Frecman (1983) were used for 
Chinese and English clauses and four additional types were added that occur in 
Chinese but not English. One table in this thesis shows the frequency of clause 
types in the English and the Chinese data sources: 


English Type 

/ 

% 

OS 

300 

66.96 

oo 

69 

15.40 

ss 

44 

9.82 

so 

5 

1.11 

S/Adv. 

29 

6.47 

O/'Adv. 

1 

.22 
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Chinese Type 

/ 

% 

OO 

132 

36.36 

OS 

127 

34.99 

SO 

33 

9.09 

SS 

29 

7.99 

TS 

2 

.55 

TO 

1 

.28 

O/Others 

29 

7.99 

S/'Others 

10 

2.75 


Rearrange these tables to demonstrate cumulative frequency. While we have not 
identified all the relative clause types for you here, what can you say about the 
numerical distribution of restrictive relative clauses in the data based on this ta- 
b,c? 

2. To compare the frequency of restrictive relative clauses in the different text 
types, a table is presented which gives relative clause frequencies in each text type 
and the total number of words in each corpus. 


/Rel Cl No. Words 


Text Source 

Eng 

Ch 

Eng 

Ch 

Comps 

92 

53 

4,700 

6,486 

Newspapers 

62 

62 

5,763 

7,952 

Journals 

192 

116 

14,411 

19,876 

Agreements 

102 

132 

6,299 

8,741 


Compute the occurrence of restrictive relative clauses per 100 words. Would 
some other unit seem more appropriate to you? 

Prepare a bar graph to show the distribution of clauses in these data sources. 

3. B. Hayashi (1985. Language attrition: changes in the Japanese language over 
three generations. Unpublished master's thesis, TESL, UCLA. - ) investigated the 
Japanese used by first-, second-, and third-generation women in a family. Among 
several tables presented in the thesis is the following which looks at the degree of 
language mixing. 


UsC of Japanese in Main and Subordinate Clauses 

1st Gen. 2nd Gen. 3rd Gen. 


Clause Type 

Main 

Sub 

Main 

Sub 

Main 

Sub 

Total 

80 

177 

62 

105 

47 

48 

Japanese 

78 

176 

51 

81 

17 

28 

% Japanese 

98 

99 

82 

77 

36 

58 


Which woman uses the most English mixing? Would you expect that the third- 
generation woman would use more main clauses than subordinate clauses in 
general in speaking Japanese? Why (not)? Are you surprised at the frequency 
of mixing in her main clauses as compared with that in subordinate clauses? Why 
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1. K. Cruttcndcn (19X6. A descriptive study of three ethnic Chinese schools. 
Unpublished master's thesis, I ESI.. I'CI.A.) gives demographic data on the 
families of students involved in three Chinese language programs. 


Ethnicity 

Total 

Schooll 

School 2 

School 3 

Chinese 

83 

34 

31 

18 

Non-Chinese 

15 

4 

3 

8 


Compute the n for each school. What percent of the families is Chinese? Non- 
Chinese? What is the ratio of Chinese to Non-Chinese? What is the ratio of 
Chinese and Non-Chinese students enrolled in the programs at each of the three 
schools? 

From questionnaires, the education statistics for the parents were: 


Education 

Total 

Schooli 

School 2 

School 3 

BA/BS 

30 

10 

13 

7 

MA/MS 

32 

24 

4 

4 

Ph.D. 

21 

5 

13 

3 

H.S. 

20 

5 

1 

14 


Note that the N is not the same as that tin ethnicity. How do you explain this? 
What proportion of the parents report having attained a college (not high school) 
degree? Do you think this is a representative sample of parents in the county? 
Why (not)? 

The language or dialects of Chinese spoken hy the parents include: 


Language 

Total 

School 1 

School 2 

School 3 

Mandarin 

57 

22 

25 

10 

Cantonese 

24 

11 

8 

7 

Taiwanese 

23 

9 

11 

3 

Other dialect 

14 

4 

4 

6 

None 

13 

5 

2 

11 


Calculate the percentages for each school separately. Do the percentages appear 
very similar across the three schools? Compute the cumulative frequency and 
cumulative percentage Figures for the totals for language groups. 

Select any of the above statistics and prepare a pic chart to show the frequency 
distribution. 

5. J. Graham (according to an article by J. Sanchez, 'The art of dealmaking" in 
the Los Angeles Times , 2/15/88, Sec. 4, p. 3.) is a marketing professor at the 
University of Southern California who has studied the bargaining behaviors of 
businessmen from many different countries. Videotapes of bargaining sessions 
are reviewed for numerous characteristics of interpersonal communication. Here 
is a sample of some of the findings related to Japanese, Korean, Brazilian, 
German, British, and American businessmen, (a) The number of times "no" was 
used by each participant during 30 minutes of negotiation: Japanese 1.9, Korean 
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7.4. Brazilian 41.9, German 6.7, British 5.4, American 4.5. (h) The average 
number of minutes each participant looked at the partner's face per 10-minute 
period: Japanese 1.3, Korean 3.3, Brazilian 5.2, German 3.4. British 3.0, Ameri¬ 
can 3.3. fc) The number of times "you" was used during 30 minutes: Japanese 

31.5, Korean 34.2, Brazilian 90.4, German 39.7, British 54.8, American 54.1. (d) 
The number of overlaps (where both speak at once) in 30 minutes: Japanese 12.6, 
Korean 44.0, Brazilian 28.6, German 41.6, British 10.5, American 10.3. (c) The 
average number of silent periods (over 10 seconds in length) between turns in 30 
minutes: Japanese 5.5, Korean 0, Brazilian 0, German 0. British 5.0. American 

3.5. (f) The average number of times each participant touched partner (excluding 
handshakes) during 30 minutes: Japanese 0, Korean 0, Brazilian 4.7, German 0, 
British 0, American 0. 

a. Prepare a graph to show all this information. 

b. Draw an overlay figure and/or a line polygon showing the information from 
the six histograms. 

c. Which display (graph or figure) do you feel is more informative and less mis¬ 
leading? Why? 

d. What linguistic factor (rather than bargaining behavior) do you think might 
account for the high frequency for "no" in the Brazilian data? 

c. If you were a consultant for a business firm, what other behaviors would you 
want to include in your research? 
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Chapter 6 

Describing Interval and 
Ordinal Values 


•Measures of central tendency 
Mode 
Median 
Mean 

Central tendency and the normal distribution 
•Measures of variability 
Range 
Variance 

Standard deviation 

Standard deviation and the normal distribution 
•Ordinal measures 


In the previous chapter, we discussed some of the ways that frequency data can 
be displayed to give information at a glance. With ranking (ordinal) and interval 
measure'-, we arc not concerned with "how many" or how often" (as we are with 
frequency data). Rather, the data will tell us "how much," and the measure of 
how much is on an ordinal or interval scale. An example of ranked, or ordinal, 
data would be a teacher's rating of student performance on a five-point scale. 
A common example of interval data arc scores obtained from a test. If 30 stu¬ 
dents in a class take an exam, the data consist of their individual scores. It 
doesn't make sense to add the scores and display them as a total. Instead, we 
want to know the most typical score obtained by students in the class. That most 
typical value is called a measure of central tendency. 


Measures of Central Tendency 

The term central tendency is used to talk about the central point in the distri¬ 
bution of values in the data. We will consider three ways of computing central 
tendency, the most typical score for a data set. The choice among these three 
methods can be made by considering particular reservations that go with each 
method. The choice of one measure of central tendency over another is especially 
important when we want to compare the performance of different groups of stu¬ 
dents (oi the changing performance of the same students over time). 
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Mode 


Mode is the measure of central tendency which reports the most frequently ob¬ 
tained score in the data. 

Mode — most frequently obtained score 

Imagine that you have designed a program that you believe will give newly ar¬ 
rived immigrant students the vocabulary they will need to deal with daily life in 
the community. The program includes an assessment vocabulary test. Here are 
the scores of 30 students on the test: 


20 

22 

23 

23 

25 

25 

25 

26 

26 

26 

26 

26 

27 

27 

27 

28 

28 

29 

29 

30 

30 

32 

33 

33 

33 

34 

37 

40 

41 

42 


Drawing a frequency polygon will show the mode most easily. For these data, 
we would draw a horizontal line and divide the line into 1-point increments from 
20 to 42, the high and low scores in the data set. Then, we would draw a vertical 
line at a right angle to the left side of this line and divide this line into 1-point 
increments from 0 to 6. The scores are shown on the horizontal line and the 
number of instances of each score is on the vertical axis. The lowest score is 20 
and there is only one such score. If you check the intersect of 20 and 1, you will 
find a plot symbol (you can use a square, a dot, or a star—whatever you like) at; 
that point. The rest of the data are plotted in a similar way. 



The curved line connecting the symbols is the frequency polygon. 

If you drop a point from the peak of a frequency polygon to the baseline, the 
number on the baseline will be the mode. The mode, then, is really a frequency 
measure for interval data. It is a measure that shows us which score was the most 
typical for the most students (score 26 in this case). 
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I here ate several problems with using the mode as a measure of central tendency 
F irst, it is possible that there will be no one most frequent term in a distribution 
or none that receives a frequency higher than 1. Obviously, when this happens, 
the mode cannot be used. Mode is also the measure of central tendency which 
is most seriously limited because it is so easily affected by chance scores. Look 
back at the set of scores on the vocabulary test. Imagine that the data had beer 
misscored. One student scored as a 26 really missed another item and so should 
have been a 25. The people with 32 and 34 scores should have received scores 
of 33. The mode is now 33. This change in central tendency is quite a shift! As 
the number of scores increases, the chances of such large shifts in the mode be¬ 
come less and less likely. 


Median 

The median is the score which is at the center of the distribution. Half the scores 
are above the median and half are below’ it. If the number of scores is an odd 
number, then the median will be the middle score. If the number of scores is an 
even number, then the median is the midpoint between the two middle scores. 

Median - center of the distribution 

In find the median, then, the scores or observations are arranged from low m 
high and the middle score is obtained. This is the way the data were arranged 
for the vocabulary test. The median, the midpoint, for the values in the data set 
is 27.5. 

The median is often used as a measure of central tendency when the number of 
scores is relatively small, when the data have been obtained by rank-order mea¬ 
surement, or when a mean score is not appropriate. As you will discover in later 
chapters of this workbook, the median is an important measure of central tend¬ 
ency for certain statistical procedures. 


Mean 

The mean is the arithmetic average of all scores in a data set. Thus, it takes all 
scores into account. And so, it is sensitive to each new score in the data. Imagine 
the scores distributed as weights along a horizontal line, the plank of a seesaw. 
The scores arc the weights on the plank. To make the seesaw balance, you have 
to move the plank back and forth until you get the exact balance point. That 
exact point is the mean. 
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You already know how to compute the mean because it is the same thing as the 
"average" score: add all the scores and divide by the number of scores. If we add 
the scores in the first data sample for the vocabulary test and divide the sum by 
the numbe: of scores, the answer is 29.1. 

Now, let's begin to use some of the convenient symbols of statistics. £ is the 
symbol which means to sum. or add. X is the symbol for an individual score or 
observation. So £X is the instruction to add all the scores. 

The symbol for the mean is Y, pronounced "X-bar." (As you can imagine, this 
meaning of \-bar was around long before it was applied to linguistic description. 
\-bar in linguistics has nothing to do with A !) I he formula for the mean is 
X-bar equals the sum of A' divided by N": 



Although the mean is the most frequently used measure of central tendency, it 
too has a limitation. It is seriously sensitive to extreme scores, scores that clearly 
do not belong to the group of scores. As an example, imagine that two additional 
students took the vocabulary test. Unfortunately, the test is not appropriate for 
students who do not read roman script. These two additional students were 
tested even though they could not read the script. Their scores were 0. If we add 
these two values to the data set for the vocabulary test, the mean for the group 
is now 27.3. 

Extreme scores can change the mean so drastically that it will not be the best 
measure of central tendency. Extreme scores should be located and checked since 
there is always the possibility of error in data entry (or in test administration). 
When legitimate extreme scores are present, then the median may be a more ap¬ 
propriate measure of central tendency. If the number of 5s is small, it is also 
likely that ’.he distribution will not be normal. For example, much of our research 
is conducted in classrooms where the number of students may be below 30. 
When there are few 5s and when they arc in an intact group (that is, 5s are not 
randomly selected) the data may not be normally distributed. The median may 
be the more appropriate measure of central tendency. 
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When there are no extreme scores and the distribution of scores appears to be 
normal, then the best measure of central tendency is the mean because it is the 
measure which takes the magnitude of each score into account. 


ooooooooooooooooooooooooooooooooooooo 

Practice 6.1 


1. Identify each symbol in the formula for the mean. 
N = _ 



X 


2. In the vocabulary test example, imagine that the scores of two native speakers 
were placed in the data by mistake. Each received a perfect score of 50. How 
:would these two scores affect the mode, the median, and the mean?_ 


> 5. The following table shows the distribution of reading speed scores for the 
30 students who took the vocabulary test. One .Vs test was illegible so only 29 
scores appear below, f irst scan the data for the mode on reading speed. 

Reading Speed 


Words/min. f 

300 3 

280 0 

260 0 

240 2 

220 1 

200 3 

180 4 

160 8 

140 4 

120 3 

100 0 

80 1 

60 0 


When there are not a large number of values, it is easy to look at the distribution 
first for the mode. This gives a quick reading on the data. Mode =_. 

What is the median for the above data?_. 

Now, compute the mean score. _. (Remember that you can't simply 

sum the wpm, but must multiply by / for each wpm score.) 
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Why are the mean, the mode, and the median different? 


4. If you visualize the scores as weights distributed on the plank of a seesaw, 
which measure gives you the best balance point for central tendency? 


OOOOOOOOOOOOOOOOOOOOCOOOOOOOOOOOOOOOO 

Central Tendency and the Nannul Distribution 

If there arc no very extreme scores and if you have 30 or more observations, you 
may have a normal distribution. A normal distribution means that most of the 
scores cluster around the midpoint of the distribution, and the number of scores 
gradually decrease on either side of ihe midpoint. The resulting polygon is a 
bell-shaped curve. 



A normal distribution is a theoretical mathematical concept. The distribution of 
data is normal to the degree that it approaches this bell-shaped curve. Notice 
that the tails, the tail ends of the curve, show where the extreme values occur. 
These are the values that occur least often in the data. When the majority of 
scores fall at the central point and the others arc distributed symmetrically on 
either side of the midpoint in gradually decreasing frequency, the distribution is 
normal, and all three measures of central tendency-mode, median, and mean- 
will be the same. 

However, the distribution of data is not always normal: that is, the frequency 
polygon may not slope away from the central point in a symmetric way and the 
tails will not look the same. Let's consider what this means in terms of the mea¬ 
sures of central tendency. Each of the polygons below differs from the normal 
distribution. 
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positive 


negative 


bimodal 


Let's suppose that you gave the students in your class an examination. The 
(scores were: 24, 25, 25, 25, 27, 27, 27, 27, 28, 28, 28, 29, 29, 29, 30, 30, 34, 36, 
38, 39, 41, 45. Look at the three polygons. Which best illustrates these data? 
^Polygon 1, right? Since some few scores are much higher (and there are no 
matching, much lower scores), the mean will be affected. It will be higher than 
:|it would be if the distribution were normal. Most of the scores cluster around a 
(central point that is pulled higher than it normally would be. This is because a 
Tew extreme scores under the right tail of the distribution pull the mean to the 
Tight. This type of distribution is said to be skewed. Since the mean is moved to 
The right by the right-hand extreme scores, we say that it is positively skewed. 

i Now imagine that the scores on the exam were 16, 17, 18, 24, 25, 25, 25, 27, 27, 
;27, 28, 28, 28, 29, 29, 29, 30. The mean is lower than it would be in a normal 
i distribution. Again, the distribution is skewed but this time in a negative direc¬ 
tion (visually towards the left of the distribution) as shown in the second polygon. 
This is a negatively skewed distribution. 

Skewed distributions are asymmetric. They show that extreme scores have pulled 
the point of central tendency in either a negative or positive direction. Look back 
at the skewed distributions. In the negatively skewed distribution, some students 
performed much worse than we might expect them to. Their data pulled the 
:mean lower. In the positively skewed distribution, some students performed 
much better than we might have expected. Their data pulled the mean higher. 
(Because of the visual form of the polygon, people often reverse the meanings of 
positive and negative skew in discussing distribution.) 

In either case, the data are not normally distributed and the mean may not be 
an accurate measure of central tendency. When this happens, it affects the kinds 
of statistical tests we can use. Many statistical tests, as you will see later, assume 
the data are normally di$tributed--that is, that X is the best measure of central 
tendency for the data. 

The final polygon show's two peaks, two different modes. So, logically enough, 
it is called a bimodal distribution. There are two peaks in the distribution and the 
data spread out from each. 
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A bimodal distribution suggests that there are two very different groups of 5s 
taking this test-one group that scores rather low and one that scores higher. It 
might lead you to suspect that those that scored low had not yet taken the course. 
Another possibility is that they entered the course late. Another is that those 
with scores under the higher mode point arc from languages which have many 
cognate words for the vocabulary tested. They may have had an advantage over 
other students. Or, possibly the students at the higher end of the distribution 
were those who had been living in the community for some time prior to enrolling 
in the course. As you may guess from this example, bimodal distributions are 
very important in research. They show us that some important independent 
variable with two levels may have been missed—an independent variable which 
has an effect on the dependent variable. For example, if you gave a Spanish test 
to a group of students and obtained a distribution like this, you might wonder if 
the data that cluster around the right mode might come from .Vs who had an 
opportunity to speak Spanish outside class while the data clustered around the 
left mode might be from those .Vs who had no such opportunities. 

Looking at a distribution curve, it is also possible to see whether the data are 
"flat'' or "peaked." Flatness or peakedness of the curve is called kurtosis. If the 
data spread out in a very flat curve (a platykut tic distribution) or in a very sharp 
peaked curve (a leptokurtic distribution), the distribution departs from the 
normal curve. 




platykurtic 


Information on skew and the number and shape of the peaks in the curve help 
us to decide whether the distribution is normal. When the distribution is not 
normal, the three measures of the most typical score will vary. The decision as 
to which is the best measure of central tendency depends on this information, on 
the presence or absence of extreme scores, and on the purpose of the study. For 
example, imagine you wanted to offer a vocabulary class in your private language 
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school. Only one class can be offered. If you had data from a vocabulary test 
for prospective students, the most useful measure of central tendency would be 
the mode. It would show you the level achieved by the most applicants. It many 
prospective students scored at the mode or close to it, you should have enough 
students to make the class financially viable for the school. You could also pitch 
the course (and your ads) at the appropriate level for the students. You can for¬ 
get about the people who scored much lower or much higher than those at the 
;mode. 

Sometimes the median is the best measure of central tendency. If the distribution 
of responses in the data includes a number of scores that skew the distribution, 
then the median is the best option. As we have seen, the median is much less 
susceptible to extreme scores than the mean or the mode. 

The mean is the measure of central tendency most frequently used when we want 
ito compare performance or responses of different groups. When we talk about 
an entering class as having an average TOEFL score of 523 while last year's class 
had a 580, the numbers arc assumed to be means of the two groups. Since we 
so frequently use the mean in comparing groups, it is important to know whether, 
in fact, the distribution of the scores was normal (i.c., whether or not the mean 
is, in fact, the best measure of central tendency). To do that, we need to consider 
fyariability in the data. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 6.2 

•►l. Imagine that you received the following data on the vocabulary test men¬ 
tioned earlier (see page 160): 


20 

22 

23 

23 

23 

23 

23 

23 

24 

25 

28 

29 

30 

30 

30 

30 

30 

30 

31 

32 

32 

33 

33 

34 

35 

35 

36 

36 

37 

37 


Chart the data and draw the frequency polygon. 
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2. Compute the measine of cculial lendency which you believe to he best for the 
ti.iin. .IiKtify your choice. Compare your choice with that of other members of 
\our stut.y group. What consensus did you reach? 


3. Think of a test that you gave (or took) recently. If the distribution turned out 
to be birr.odal, what hypotheses might you make about it? 


ooooooooooooooooooooooooooooooooooooo 


Measures of Variability 

The measure of central tendency tells us the most typical score for a data set. 
This is important information. However, just as important is the variability of 
scores within the data set. Suppose that you gave the vocabulary test to three 
different groups of students. Each group achieved a mean score of 30. Does this 
mean that the performance of the groups was the same? No, of course it doesn't. 
The variability of the scores, how they spread out from the point of central 
tendency, could be quite different. 

Compare the frequency polygons for these three different groups. 



As you can see, the distribution of scores in the three groups is quite different ; 
while the mean for each is the same. The first and second curves are symmetric % 
while the third is skewed. If we compare the first with the third, the third is f 
negatively skewed by the scores under the left tail of the distribution. While the i\ 
first and second curves are alike in being symmetric, the distribution.of scores in 
the second is almost flat—the scores spread out almost equally across the range. J 
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In the first curve, though, most of the scores in the distribution cluster around the 
mean. 


In order to describe the distribution of interval data, the measure of central 
.tendency will not suffice. This is particularly true, as we have just seen, when 
we wish to compare the performance of different groups. To describe the data 
: more accurately, we have to measure the degree of variability of the data from 
: the measure of central tendency. 

; just as, we have three ways of talking about the most typical value or score in the 
data, there are three ways to show how the data are spread out from that point. 
These are called range, variance, and standard deviation. 


The easiest way to talk about the spread of scores from the central point is range, 
i For example, the average age of students in a particular adult school ESL class 
jiis 19. The youngest student is 17 and the oldest is 42. To compute the range 

• subtract the lowest value from the highest. 

Range = Xhighest ~ ^lowest 
Range = 42 — 17 = 25 
The age range in this class is 25. 

Imagine the distribution plank for this age data. Nineteen, the mean, is the bal¬ 
ance point on the seesaw. The lowest "weight" on the seesaw is 17. The highest 
is 42. The balance point is much closer to 17 than 42, so we know many more 
:Ss group at the lower end of the range. We can predict, then, that this is not a 

• normal distribution but one that is positively skewed. The mean is higher than 
it would be if it were not for the extreme score of 42. Range is a useful, informal 
■measure of variability. However, it changes drastically with the magnitude of 
^extreme scores. If you had test score data where one person simply didn't do the 
.test (scored zero), the range would dramatically change just because of that one 
score. Since it is an unstable measure, it is rarely used for statistical analyses. 
Yet, it is a useful, first measure of variability. 

Because range is so unstable, some researchers prefer to stabilize it by using the 
semi-interquartile range instead. The semi-interquartile range (SIQR) gives the 
range for the middle 50% of the scores. 

The formula for the SIQR is: 


Q 3 is the score at the 75th percentile and Q l is the score at the 25th percentile. 
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For example, imagine that for the TOEFL test, the score at the 75th percentile 
is 560 and 470 is the score at the 25th percentile. The SIQR would be: 

560-470 _ , 15 


One nice thing about the SIQR is that it is much less affected by extreme scores 
than the range is. Moreover, it can be used in skewed distributions where there 
are extremely high or low scores since the extreme scores arc not considered in 
calculating the SIQR. In a skewed distribution, the median can be used as a 
measure of central tendency and the SIQR can be used as a measure of variabil¬ 
ity. Of course, the disadvantage of the SIQR is that percentile scores must be 
available or calculated. 

OOOOOOOOOQOOOOOOOOQQOOOOOOOOOOOOOOOOO 

Practice 6.3 

1. Look, back at the first data set for the vocabulary test (page 160). The mean 
for the data was_. The range is_. Again imagine the distri¬ 

bution of the data with each score as a weight on a plank and the mean as the 
balance point. Does this picture appear to be a normal distribution? Why (not)? 


2. Imagine that you are evaluating a special English reading program for a na¬ 
tional Ministry of Education. On what grounds would you argue for and against 
using the median and the SIQR as the measure of variability (thus deleting data 
above the 75th percentile and below the 25th percentile)?_ 


OOOOOOOOOOQOOOOOOOOOOOOOOOOOOOOOOOOOO 

Variance 

Suppose you were teaching an introductory linguistics class and when you re¬ 
turned a set of midterm exams, you announced that the mean score on the exam 
was 93.5. You can be sure that all students immediately check to see how close 
their scores arc to the average for the test. If a student scores 89, the score is 4.5 
points from the mean. This is the deviation of one score from the mean. 

Students are interested in how much better (or worse) they perform than the av¬ 
erage for the class. For research, however, we want to know more than just one 
individual's placement relative to the mean. We want a measure that takes the 
distribution of all scores into account. One such measure is variance. 
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To compute variance, we begin with the deviation of the individual scores from 
the mean. A lowercase, italicized x is used to symbolize the deviation of an in¬ 
dividual score from the mean. 

If we added all the individual variations of scores from the mean for the midterm 
exam, we would have the total for variability—X-v. However, we know the total 
for variability is, in part, a reflection of the number of observations or 5s' score. 4 
in the data. So, we need to find an average variability for the distribution. 

The following table shows the scores for the midterm exam. If you compute th< 
/V, you will find that it is 93.5. 


Midterm Exam 


X 

x = (X — X) 

X 2 

X 

X 

X 2 

100 

6.5 

42.25 

85 

-8.5 

72.25 

88 

-5.5 

30.25 

82 

-11.5 

132.25 

83 

-10.5 

110.25 

96 

2.5 

6.25 

105 

11.5 

132.25 

107 

13.5 

182.25 

78 

-15.5 

240.25 

102 

8.5 

72.25 

98 

4.5 

20.25 

113 

19.5 

380.25 

126 

32.5 

1056.25 

94 

.5 

.25 

85 

-8.5 

72.25 

119 

25.5 

650.25 

67 

-26.5 

702.25 

91 

-2.5 

6.25 

88 

-5.5 

30.25 

100 

6.5 

42.25 

88 

-5.5 

30.25 

72 

-21.5 

462.25 

77 

-16.5 

272.25- 

88 

-5.5 

30.25 

114 

20.5 

420.25 

85 

-8.5 

72.25 


In the second column are the individual difference values (x). Each value is the 
difference between the individual score and the mean (X — X). This was the first 
step in the computation. Jf you sum these individual deviation scores, the result 
may surprise you. X(V — X) = 0 . Does the result surprise you? Remember that 
the mean is the balance point on the seesaw of the total distribution. If you add 
all the minus weights on one side of the seesaw and then all the plus weights on 
the other side, you should get zero because they balance each other out. 

We have already said that we want an average of all the individual deviations 
from the mean. Obviously, adding them all and dividing them by the number 
of scores or observations won't work if the total is zero. To solve this dilemma, 
we square the deviation of each individual score from the mean (these are shown 
in the column labeled x 2 ) and add these. This total, sometimes called the sum of 
squares , shows the total variability in the data set. 

Our next step is to find an average for this total variability figure, 2> 2 . That is, 
we don't want the measure of variability to increase just because there are lots 
of scores in the distribution. When we average a total, we usually divide by the 
number of cases (the N). It would be perfectly legitimate to do this, if we had a 
large N (over 100 scores). 
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However, with a small sample, mathematicians have determined that it is more 
accurate to divide the total by V I. since .V • 1 produces an unbiased estimate 
of the variance. If we divide by /V - I. the result will be: 


/V - 1 25 


The formula for variance, then, is the following. 


Let's summarize the steps shown in this formula once again. 


1. Compute the mean: X. 

2. Subtract _the mean from each score to obtain the individual deviation scores: 
.v X-X. 

3. Square each individual deviation and add: 


4. Divide by N — 1: —-. 

A - 1 


ooooooooooooooooooooooooooooooooooooo 

Practice 6.4 

► I. Prospective teachers can receive 20 points on an instrument developed to 
screen urban school teachers. Here are the scores of 10 prospective teachers. 

5 Score X-X x 2 

1 16 

2 13 

3 13 

4 19 

5 18 

6 15 

7 20 

8 II 

9 14 

10 15 


Fill in the values in the chart. 
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s.As a review of symbols, supply their values in terms of the example data. 


► 3 . What is the variance shown in the data for practice item 1 above? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

tandard Deviation 

you understand the concept of variance, then you already understand standard 
viation. The two measures are very similar. Variance is used in many statis- 
bal procedures, but standard deviation is more often reported in research articles 
id so the term may be more familiar to you. Both measures of variability at- 
mpt to do the same thing—give us a measure that shows us how much vari- 
ility there is in scores. 

ifhile range simply looks at the scores at each end of a distribution, variance and 
andard deviation begin by calculating the distance of every individual score 
om the mean. Thus, they take every score into account. Because individual 
‘ores will be either below or above the mean and the mean is the balance point. 
Life deviation of each score must be squared. Once squared, these deviations are 
totaled. To get an average for the deviations, the total is then divided by N — 1. 
ip to this point the two measures are the same. Standard deviation goes one step 
urther. Since we began by squaring the differences of each score from the mean, 
ve now do the reverse. We change it by taking the square root of the variance. 

You may see different directions, different ways of representing formulas. Can 
you see that the following directions say the same thing? (s, here, stands for 
tandard deviation. You may also see it abbreviated as s.d. The symbol s is used 
n formulas but many researchers use s.d. as a label in tables and charts.) 


\YiX-Xf 
N- 1 


f you have a problem with this, talk through it in your discussion group. 
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It: mdci m elaiify I lie concept of slandaid deviation, we have asked you lo begin 1 
by subtracting the mean from each individual score, [ here is, however, a much: 
easiei way that uses raw scores instead. The formula is: 


V N- I 


Let's reed through the formula. First it says to square each of the scores. Then: 
the formula asks that we sum these. Let's do this with the midterm exam data: 

Midterm Exam 


X 

X 1 

X 

X 2 

100 

10000 

85 

7225 

88 

7744 

82 

6724 

83 

6889 

96 

9216 

105 

11025 

107 

11449 

78 

6084 

102 

10404 

98 

9604 

113 

12769 

126 

15876 

94 

8836 

85 

7225 

119 

14161 

67 

4489 

91 

8281 

88 

7744 

100 

10000 

88 

7744 

72 

5184 

77 

5929 

88 

7744 

114 

12996 

85 

7225 


The total for X, or LAf, is 2431, and we must square this value according to the: 
formula (5909761). Then, the 'fX 2 is 232567. We can place these values in the: 
formula. If you arc confused, remember that 'fX 2 means to first square and then 
add. (LX) 2 means that you First sum and then square the total. 


^-[(^Af^V] 
N - 1 


4 


232567-(5909761 -r 26) 


25 


s — 14.52 

If you work with a hand calculator, the formula that uses raw scores will save you 
time. If you have entered the data into the computer, the computer will, ot\ 
course, do all this calculation for you. 
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I^hat does the actual standard deviation figure tell us? We have said lliaL it is a 
easure of variability of the data from the point of central tendency. While we 
pill elaborate much more on the concept of standard deviation later, the impor- 
|iot thing to realize now is that the larger the standard deviation figure, the wider 
ge range of distribution away from the measure of central tendency. The data 
Ire more widely scattered. The smaller the standard deviation figure, the more 
Similar the scores, and the more tightly clustered the data arc around the mean. 



ff you have been successful in visualizing a seesaw with data spread out on a 
plank and the mean as the balance point, the following visuals may be more 
helpful. Think of the standard deviation as a ruler that measures how scattered 
out on the plank the data actually are. When the data are widely scattered the 
Standard deviation is large and the ruler is long. When the data are tightly clus¬ 
tered around the balance point of the seesaw, then the standard deviation "ruler'' 
ikshort.. 


s 


Standard Deviation 


2 3 4 5 6 7 8 9 10 11 12 


< -94 ) 

Standard Deviation / s 



How might this information on standard deviation help you? Imagine that you 
work in a language school where students are placed on the basis of an entrance 
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exam. You will leach one advanced class and one low-inicrmcdiate class. There 
arc three sections of each class; you can select your sections. (All three sections 
meet at the same time so convenience as far as time is concerned is not an issue!) 
You are given first choice. I he director presents you with the following infor¬ 
mation: 


Placement Exam Scores 


Section 

X 

s 

Advanced 

1 

82.4 

24.1 

2 

81.0 

4.5 

3 

80.9 

8.2 

Low Intermediate 

1 

30.1 

1.1 

2 

29.4 

12.4 

3 

25.8 

8.7 


As you can see, the sections arc arranged according to the mean score for the 
section. Look first at the difference in the mean scores for the three advanced 
classes. The means arc very close together. Now look at the standard deviation 
values for the three advanced classes. The scores for section 2 are all close to¬ 
gether. This class is the most homogeneous. Section I, on the other hand, has ; 
much greater variability. If you are an advocate of cooperative learning, you. 
might want a class where students ate at different levels of ability. In that case, 
you'd probably select section 1. If you like to work with classes where everyone 
is at about the same level, you might decide between sections 2 and 3. 

Wc can make more informed decisions if we consider not just the mean (or me¬ 
dian or mode) but also the standard deviation. The standard deviation gives us. 
information which the mean alone cannot give. It is pcihaps even more impor¬ 
tant than the measure of central tendency. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 6.5 

1. In the above example, which two classes would you select? Why?_ 


► 2. For the example given in the last practice exercise ( page 172), compute the 
standard deviation using first the formula 5= v variance . s = _. 

Now. recompute the standard deviation using the raw score formula. Remember 
you will have to calculate X 1 , the square of each score, and XA’ 2 , the sum of these 
squared scores. To review the meaning of each symbol, fill in the values below. 
Then place these in the formula for the standard deviation and show the results 
below. 
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Igoooooooooooooooooooooooooooooooooooo 

Standard Deviation and the Normal Distribution 

Given that we already know that the larger the standard deviation, the wider the 
spread of scores away from the mean, what else can standard deviation tell us? 
If the distribution of the scores is normal, the standard deviation can give us a 
great deal of information. We already know that in a normal distribution, the 
Mean, mode, and median are the same. The mean score then is equivalent to the 
Median. When that is so, half the scores will be higher than the mean and half 
Mill be lower. Not only can we say that roughly 50% of the scores will fail above 
/and 50% below the mean, but we can approximate the percent of scores that will 
Tall between the mean and 1 standard deviation above or below the mean. 

Tf the distribution is normal, 34% of the scores will fall between the mean and 1 
standard deviation above the mean. Standard deviation works like a ruler mea¬ 
suring off the distance from the mean for 34% of the data in normal distri¬ 
butions. The same is true for 1 standard deviation below the mean. 68% of all 
the scores will occur between i standard deviation below and 1 standard devi- 
iation above the mean. Notice that we have not specified the numerical value of 
The mean in this statement. Nor have we specified the numerical value of the 
{Standard deviation. In a normal distribution, the most typical score is specified 
i/hs the mean. Once we know the value of the standard deviation for the data, 
we have a "ruler" for determining the distance from the mean of any proportion 
;$f the total data. Look at the following bell-shaped distribution curve. 



Il 
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50"« (34 j 13 * 3) of the scores full between the mean anti 3 standard devi¬ 
ations above the mean. 50% fall between the mean and 3 standard deviations! 
below the mean. 68% (34 t- 34) of the scores fall between I standard deviation! 
below and +1 standard deviation above the mean. 94% (34 » 34 -t 13 • 13) 
of the scores arc between +2 standard deviations from the mean. That leaves 3% 
under each tail between 2 and 3 standard deviations above and below the mean. 

In succeeding chapters, we will see how useful this information can be for re¬ 
search where we want to compare the performance of groups (or the changing 
performance of the same group over time). For now, you should realize that 
means of groups may be very similar and yet the groups, in fact, may be quite 
different because the distribution of scores aw'ay from the mean may be quite 
different. In some eases, the standard deviation may be very large because the 
data are spread out in the distribution. In other eases, the standard deviation 
may be cuitc small because the scores cluster around the mean. If we give a class 
a pretest, teach the content of_the test and then give a posttest, we expect two 
things will happen. First, the X will be higher on the posttest, and second, the 
standard deviation will be much smaller. _Sometimcs, when the material to be 
learned is very difficult, large gains in the X may not occur but we would expect 
that instruction would have the effect of making students perform more similarly.! 
The standard deviation should, therefore, be smaller. 


Ordinal Measures 

In chapter 5 we talked about ways frequency data might be summarized and! 
displayed. In this chapter, we have presented the ways in which interval data are 
typically summarized. This leaves a gap regarding rank order, or ordinal, mea¬ 
surement. And it is precisely here that we believe (others may or may not agree) 
that it is important to be clear about just how variables arc measured. 

In many statistics books, the measurement of variables is treated as either discrete 
'or continuous. Discrete refers to measurement of a nominal (categorical) variable 
which by its nature either is or is not present. There is no measure of "how 
much." for the choice is either 100% or zero. Such variables are tallied in fre¬ 
quency counts. Ordinal and interval data arc then grouped together as continuous 
data as though they were the same thing. That is, ordinal measurement where 
variables are rated or ranked in relation to each other (and thus tell us "how 
much more") is seen as effectively the same as interval measurement where "how 
much" is measured on an equal-interval basis. 

While it is true that in some instances ordinal and interval measurement may be 
very sim.lar, in other cases they are not. 

To begin to understand this Issue, consider the following. Prospective FSL 
teachers in Fgypt, studied by Ll-Naggar and Measley (1987), were asked to re¬ 
spond to items regarding the value of microteaching (i.e., videotaped peer teach¬ 
ing where the performing teacher-trainee reviews the tape with peers and 
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supervisor). The items were statements and each student responded according to 
how strongly they agreed with the statement. For example: 

/ prefer to receive feedback from my supervisor rather than from my peers. 

1 2 3 4 5 

|fhe numbers show 1 = strongly disagree, 2 = disagree, 3 = neither agree nor disa¬ 
gree, 4 - agree, 5 = strongly agree. 

The question is whether the scale is equal-interval. How large an interval is there 
between agree and strongly agree ? Between neither agree nor disagree and 
agreed Are these intervals the same? Can we use a mean score as a measure of 
central tendency for such data? 

If we believe the intervals are equal, we should be able to compute a mean and 
standard deviation for the responses. If not, then it makes no sense to compute 
these measures. 

Assume you were the teacher in this course and you gave the students the ques¬ 
tionnaire before you assigned them their grades. How spread out do you imagine 
their ratings might be (i.c., what would you guess the range to be)? (If it is small, 
then we need to question whether this is truly a Five-point scale at all.) If wc 
think, the scale is like interval data, then we would think responses to such 
questions can properly be summed to give a single numerical value that reflects 
|£*ach individual's attitude toward microteaching. 

If you believe that the scale is really continuous or equal interval and that the 
distribution approaches the normal distribution, then it is quite proper to go 
ahead and use the mean and standard deviation in describing the data. If you 
do not believe this is the case, then you would more likely discuss each question 
separately, reporting on the number of teachers who selected a rating of 1 or 2 
br 3 or 4 or 5, treating the data in terms of frequency. Thus, you could use the 
fiftode to show the most popular choice for the teachers on each item, or you could 
Report the proportion of teachers who selected each point on the scale. 

Let's consider another example. Interlanguage researchers (researchers interested 
in the the emerging language system of the second language learner) often use 
counts to show how frequently a particular grammar or phonological structure is 
psed correctly by learners. They count the number of times the learner uses the 
structure correctly and the number of times it is incorrect (or omitted). This 
yields a "percent correct" figure for each structure. With such Figures from many 
^earners, researchers can make claims about difficulty of particular structures. 
Do you believe the scores have received interval measurement? Certainly per¬ 
centage intervals are equally spaced. Many researchers would agree that „uch 
conversions do result in continuous interval data and that a "mean percent cor¬ 
rect" is an appropriate description of central tendency for the data. We would 
brgue that the data began as open-ended frequencies which, even though con¬ 
certed, still may not be continuous. The distribution of the data may not ap¬ 
proach a normal distribution and so the mean and standard deviation are not 
‘appropriate statistics in such cases. 
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If you agree with us, would you, nevertheless, feel comfortable about rank¬ 
ordering the structures in relation to each other? That is, you might not want to 
say that a structure with a "score" of 88 was ten points higher than a structure 
with a "score" of 78, but you might feel justified in ranking one above the other 
without specifying the interval between them. The data would be ordinal, and 
the median would be used as the best measure of central tendency. Whichever 
choice you make, you should be prepared to justify your decision. This decision 
is important, for it affects the type of statistical procedure you will use in testing 
hypotheses. 

Statisticians most often argue in favor of treating rank-order data as though it 
were interval data. They see ordinal scales as being continuous (with the mean 
as the appropriate measure of central tendency). Many argue that statistically if 
makes no real difference and that it is best to treat ordinal and interval as the 
same. In many cases we believe it. is better to consider them as ordinal rather 
than as interval data, and especially so if the data are not normally distributed; 
We will argue—and this argument will continue in the succeeding chapters-that 
the researcher should make an informed decision, informed by the true nature of 
the data and that the decision should be justified in the research report. This; 
decision is crucial because it has important consequences for selecting an approx 
priate statistical test for the data. 


ooooooooooooooooooooooooooooooooooooo 

Practice 6.6 

1. Assume that you administered a 30-item questionnaire to teacher trainees as : 
a replication of the El-Naggar and Heasley study. Would you feel comfortable 
summing each student's responses on the 30 items and treating this as an 'atti¬ 
tude" score-similar to a test score? If not, why not?_ 


2. Our university, at the end of each term, requires students to evaluate courses 
and teachers. Questions such as "How would you rate the overall performance 
of your instructor?" are followed by a 9-point scale. Do you believe this 9-point 
scale does, in fact, represent a more continuous measure than shown in the pre¬ 
vious example? If so, why?_ 


ooooooooooooooooooooooooooooooooooooo 

Conclusion 

In all research reports where rank-order or interval data have been collected, it 
is important to display a measure of central tendency (mode, median, or mean). 
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and a measure of variability (range, standard deviation, or variance). While we 
always want to know what the most typical score might be, the distribution of 
scores around that typical score is perhaps even more important. Why this is so 
will be illustrated in the next chapter. Before we turn to that issue, let's consider 
again why researchers select one or another measure of central tendency and 
fwhen one might use one measure of variability rather than another. 

|f you review the section on measures of central tendency, you will remember that 
fthe mean score is very sensitive to extreme scores. If these extreme scores repre- 
|ent "outliers" that can clearly be shown to "not belong" to the data, it is possible 
|whcn there is strong justification) to remove them from the data and follow them 
as separate case studies. If such scores seriously skew the data, the distribution 
will not be normal. If the data do not approach a normal distribution, this will 
limit the types of statistical procedures that can be used in testing hypotheses. 

Some researchers consider data removal a questionable procedure for there is no 
established methodology or set guidelines for deciding what constitutes an outlier 
or when outliers can safely be deleted. Of course when there is no clear expla¬ 
nation for outlier performance, they must remain in the data and the researcher 
would be wise to opt for the median as the best measure of central tendency. For 
example, if you work with census tract data (where data from census counts are 
displayed within mapped geographic areas) and find that some families report 
incomes that are extremely high or low compared to that of other respondents, 
there is no legitimate way of removing these families from the data sample, 
ffhereforc, the median is usually used for such data. 

ffhe mode or the median are often used in reports where data are drawn from 
beginning language learners, from preschool children, or from persons with lan¬ 
guage disorders. Performance in these cases may vary considerably. When this 
is so, the mode may be the most appropriate measure, but if there are only a few 
extreme scores, the median should be used. 

If the data come from rank-order scales rather than true interval measurement, 
you should discuss the options available to you with an advisor or statistical 
consultant. If the rank scales are very interval-like and the distribution of the 
data appears to be normal, you should feel comfortable in reporting the data in 
terms of X and s.d. If they are not, you may present the information in terms of 
the proportion or number of persons who placed at each point on the scale. You 
might also give the mode, or the median, and the range (depending on the focus 
of the study). In the first instance, you would use the X and 5 or variance in 
testing the hypotheses. In the second, you might opt for the median. The decision 
is yours, but it is a decision which will have important consequences. These 
consequences will become clear in subsequent chapters of this volume. 


Activities 

1. Our university, worried about the legality of asking immigrant students, like 
foreign students, to take an ESL English language proficiency test, thought they 
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might .isc ;j combination of SAT scores and length of residence to locate students! 
in nm: of special HS1. instruction. To check this, we found 69 students for whom; 
wc had ESL test scores, SAT scores, I I background, and length of residence in-: 
formation. The following table is taken from the report. 


VARIABLE 

N 

MEAN 

STD DEV 

ESL Test 

69 

109.043 

20.217 

SAT Verbal 

69 

307.101 

76.660 

SAT Math 

69 

540.970 

105.923 

Language 

69 

6.623 

2.000 

Months in US 

69 

61.723 

39.118 

SAT Tot 

69 

847.971 

147.754 


Interpret the table in the following terms. (I) Is it appropriate to display the: 
means and standard deviations for each of these variables? If not, why? (2) If! 
this were your study and you wished to display this information, would you ar> 
range .he information in the same way? (3) What sort of a visual display might 
you use for this information? 

2. A. Spector-Leech (1986. Perceptions of functional and cultural difficulties by: 
Japanese executives in Southern California. Unpublished master's thesis, TESL, 
UCLA.) administered a questionnaire to 411 Japanese business managers living; 
in the United States. A table presents the following information on their ages: 

_ Age 

X 45.5 yrs 

5 7.2 yrs 

Youngest 26.0 years 

Oldest 63.0 years 

Do you imagine that this is a random sample of Southern California-based 
Japanese business managers? What is the age range? With an n of 411 can you 
assume normal distribution? If so, what is the mode? The median? Draw a 
normal distribution polygon. Using the value of the standard deviation, label the; 
ages for 1,2.3 standard deviations above and below the mean. According to this < 
curve, are some of the managers predicted to fall outside the age range actually: 
listed? What proportion of the managers are predicted to be between 52.7 years 
and 3S.3 years? How many people does this proportion represent? 

3. D. McGirt (1984. The effect of morphological and syntactic errors on holistic ; 
scores of native and nonnative compositions. Unpublished master's thesis, TESL,; 
UCLA.) noted that nonnativc students and native students alike must receive, 
undergraduate instruction in composition. Although they may elect to take ESLi 
composition rathei than English composition, students receive the same credit. 
Instruction in parallel" courses is always an issue even though the students in 
these two classes may have very different instructional needs. One motivation: 
behind this study was a desire to know whether students should cover the same: 
material in the two courses (i.c., less emphasis on grammar and morphology in 
the ESL classes). American composition teachers were asked to give holistic 
ratings to compositions from these two groups. In one set, the ESI. compositions: 
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were typed in their original form and interspersed with those written by native 
^speakers. In the second condition, the morphology errors in both ESL and native 
speakers' compositions were corrected before they were randomly ordered for 
ratings. The ratings were holistic (ranging from 1.5 to 15.5). Interpret the fol¬ 
lowing chart: 


Group 

X 

s 

Native speakers 

With errors 

9.21 

2.12 

Without errors 

10.10 

2.13 

Nonnative students 

With errors 

5.80 

2.21 

Without errors 

8.56 

1.66 


In which group of compositions was there the greatest agreement among ratings? 
In which was there least? Thirty native speakers and 30 nonnative students en¬ 
rolled in ESL or English composition served as subjects in this study. Do you feel 
|the 5s probably are representative of students normally enrolled in these classes? 
If not, why? The figure shows the scores given the compositions when errors were 
Knot corrected. Comment on the distribution of scores. 



Score 


4. J. Phillips (1984. A comparison of the writing performance of immigrant and 
foreign ESL students at UCLA. Unpublished master's thesis, TESL, UCLA.) 
carried out an error analysis as one part of his thesis. The mean length of com¬ 
positions for the two groups were as follows: 

Immigrant Foreign _ 

274.542(A)) 240.143(A f ) 

76.713 (sj) 80.123 (sj) 
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Those subscripts I anti F look pretty spiffy. don't they! What do they represent? 
Which group appears to write longer compositions? In which group is there more i 
variability? If you add I standard deviation to the mean of the immigrant group, 
what is the total? If you add 1 standard deviation to the mean of the foreign 
student group, what is the total? Mow different do the two groups appear at this 
point? At 2 standard deviations from the mean? Assume for the moment that ; 
the data are normally distributed and chart the two means and standard devi- 
ations onto an overlapping polygon. What does the distribution look like? 

5. T. Mannon (1986. Teacher talk: a comparison of a teacher's speech to native 
and nonnative speakers. Unpublished master's thesis, TESL, UCLA.) counted 
the frequency of a number of structures a teacher used in her content lectures to I 
native speaker students and in her lectures to ESL students. The talk data were 
entered as T-units (roughly equivalent to an independent clause). The forms 
(c.g., number of tag questions, imperatives, referential questions) were then ; 
counted per T-unit. The teacher, thus, would have a total number of T-units in : 
a lecture. Some of these would, for example, include tag questions. A percentage : 
figure for tag questions could thus be obtained. A series of charts, similar to the 
following, were presented for each structure. 

.V tag Qs s.d. 

Icachcr * NS .65% 1.29% 

Teacher -*■ NNS .91% 2.02% 

Did the teacher use tag questions more or less often than once in 100 T-units? : 
Using the mean and standard deviation figures, attempt to draw a distribution;! 
polygon for these data. What happened? Consult others in your study group to 
see what they think about the data. What conclusion can you reach? 

6 . R. Hinrichs (1985. WAN DAM: A critical evaluation of objectives and uses 
among developers, teachers, and students. Unpublished master's thesis, TESL, ; 
UCLA.) asked 128 freshmen to rate a writing software package. The figure be¬ 
low’ displays means and standard deviations for their responses (a 9-point rating 
scale on effectiveness of the program for seven functions). The statements for the 
seven functions asked if the users found WAN DA H saved them time in writing, 
helped them revise, help them organize their writing, made them better writers, 
was easy to read on the screen, whether the program was confusing, and whether 
they used the package for all their writing. Convert the display to a numerical 
table. If this were your study, which would you use for your final report? Why? 
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mean 


7. C. Holtcn (19X4. The use of authentic language materials in low-intermediate 
[■; S1. classes. Unpublished master's thesis, I LSI., UCLA.) as one small pail of 
her evaluation, asked observers to evaluate materials for three discourse units— 
^description, narration, and process. Here are tables for their ratings (1 to 5 with 
5 being highest). 


Description Unit 
N — 6 observers 
Lesson Presentation 
Materials Format 
Teachability of Materials 
Student Participation 
Student Reaction 


Narration Unit 
N = 7 observers 
Lesson Presentation 
Materials Format 
Teachability of Materials 
Student Participation 
Student Reaction 


Process Unit 
N = 5 observers 
Lesson Presentation 
Materials Format 
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Teachability of Materials 

4.6 

.68 

Student Participation 

3.9 

1.1 

Student Reaction 

4.0 

.77 


In which ratings did the observers most closely agree? In which did their ratings 
vary the most? Consider for a moment the other options open for displaying 
these data. Would you have preferred to have a frequency chart for the ratings 
(1-5) with the number of observers, who selected each (instead of the mean and 
s.d. for each)? Why or why not? Can you think of any other way you might want 
to display these data? 
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Chapter 7 

Locating Scores and Finding 
Scales in a Distribution 


•Locating scores in a distribution 
Percentiles, quarliles, and deciles 
•Locating comparable scores in distributions 
Standardized scores and the normal distribution 
z scores and T scores 
•Distributions with nominal data 

Implicational scaling (Guttman scalogram) 

•Other applications of distribution measures 

When describing data (rather than testing our hypotheses) we often want to lo¬ 
cate individual scores in a distribution. This location may be in terms of 
percentile ranks or by placement in levels such as quartiles or deciles. We fre¬ 
quently need to locate individuals in a distribution where information comes from 
several sources rather than just one. In addition, we sometimes want to look at 
a distribution to discover whether or not a single scale can be discovered within 
Which individual 5s or observations can be ranked. All of these issues relate to 
distribution of data and the location of individual scores within that distribution. 


Locating Scores in a Distribution 


Percentiles, Quartiles, and Deciles 

If you have ever taken an ETS test, the TOEFL, the Miller Analogies Test, or the 
Graduate Record Exam, you probably received a card in the mail with your score 
and the percentile level based on scores of 5s who took the test previously and 
established norms for students from your major (education, engineering, English, 
etc.). The report doesn't always say how well you did compared with students 
who took the test at the same time that you did, but rather how well you did in 
comparison with other people from your area of expertise who .have taken the 
test. (You should always check the fine print to find out exactly how your 
percentile rank has been set.) 

When we compute a percentile rank, we locate an individual score in a distri¬ 
bution. Basically, percentiles locate an individual score by showing what percent 
of the scores are below it. Actually, the measurement isn't just "below" the score, 
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but I'r'ithcr below the sco : c plus half of those that received the same maik. So- 
we usually say the percentile shows the number of scores "at or below the indi¬ 
vidual score in the distribution. If you receive a card that reports you as being 
at the 92 percentile, this means that you did as well or better than 92% of people 
tested on the exam. For example, the TOEFL publishes the following informa¬ 
tion so that examinees can discover how well they compare with others taking the 
test: 

TOEFL SCORE COMPARISON TABLE 
(based on the score of 759,768 examinees 
who took the test from July 1985 through June 1987) 


SECTION SCORES 


TOTAL 

See 1 


See 2 


Sec 3 


Your 

%ile 

Your 

%ile 

Your 

Voile 

Your 

Voile 

Score 

lower 

Score 

lower 

Score 

lower 

Score 

lower 

660 

99 

66 

98 

66 

97 

66 

98 

640 

97 

64 

95 

64 

94 

64 

96 

620 

93 

62 

90 

62 

90 

62 

92 

600 

89 

60 

85 

60 

85 

60 

87 

580 

83 

58 

78 

58 

77 

58 

SO 

560 

74 

56 

71 

56 

69 

56 

71 

540 

64 

54 

62 

54 

59 

54 

61 

520 

52 

52 

52 

52 

50 

52 

50 

500 

41 

50 

41 

50 

40 

50 

39 

480 

29 

48 

30 

48 

31 

48 

2‘) 

460 

20 

46 

20 

46 

22 

46 

22 

440 

13 

44 

12 

44 

15 

44 

15 

420 

8 

42 

7 

42 

10 

42 

10 

400 

4 

40 

4 

40 

7 

40 

6 

380 

2 

38 

2 

38 

4 

38 

4 

360 

1 

36 

1 

36 

2 

36 

2 

340 


34 


34 

1 

34 

1 

320 


32 


32 

1 

32 

1 

300 


30 


30 


30 



From the 1988-1989 Bulletin of Information for TOEFL and TSE. Reprinted 
with the permission of the Educational Testing Service, Princeton, NJ. 


O^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 7.1 

► 1. Use the TOEFL table to answer the following questions. 

a. What is the percentile ranking for the following scores of one of your ESL 
students? Section I: 54_; Section 2: 62_; Section 3: 42_. 
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fjy If your student obtained a total score that placed her at the 52nd percentile, 
|fou could place her score as_. 

Since the percentiles are based on the scores of a set number of examinees, it 
;is possible to turn these into actual frequencies. How many persons received a 
itotal score of 380 or less?_. 

d. If your school requires a TOEFL above 500 for student admission, how many 

fof these examinees would not have met the admission requirement?_. 

: i'How many would have met the requirement? . 

£2. If your institution uses the TOEFL to screen applicants, what cutoff point do 
they use? _ What percentile point does this cutoff represent? _ 

ooooooooooooooooooooooooooooooooooooo 

frhe student's percentile rank for the total test is usually the most important piece 
fbf information (since universities often use a set cutoff point for admitting or re- 
Ijecting students). However, the percentiles for the sections also give us valuable 
jihformation. For example, the overall percentile for a foreign student and an 
immigrant student might be the same but we would predict that immigrant stu- 
iiients (who have had more opportunities to use the oral language) might outper¬ 
form foreign students on some sections and perhaps do less well on other sections 
Of the test. Students from different majors might also have the same percentile 
Overall but differ in performance on the different sections. This information 
fwould be useful for placement purposes and for curriculum design. 

fTo compute percentiles, we need to know what the distribution of scores looks 
dike and the place of the individual score in the distribution. Assume you ad- 
frh in isle red a test to 75 5s entering your school's ESL program. The distribution 
of scores was as follows: 



Frequency 

Relative 

Cumulative 


Score 

(/) 

freq. 

freq. (F) 

Voile 

50 

6 

.08 

75 

96 

■ 40 

18 

.24 

69 

80 

30 

27 

.36 

51 

50 

20 

18 

.24 

24 

20 

10 

6 

.08 

6 

4 


you do not remember how the column labeled "Cumulative Frequency" (F) was 
^imputed, please review chapter 5, page 133. 

The percentile requires that you locate the cumulative frequency at the point of 
fthe individual score. If the score was 40, 18 people received the score. The cu- 
fftiulative frequency for a score of 40 is 69. Divide this by the total N (which is 
.75 in this case). Then to make it a percent figure, multiply by 100. The result 
$92%) shows what percent of the 5s who took the test scored at or below 40. This 
••figure is often reported as the "percentile." However, a more precise definition 
if percentile tells us how a particular S scored in terms of "as well as or better 
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than" other 5s in the distribution. Percentile locates the score relative to th$ 
proportion of scores at or below it in the distribution. The F used, then, includes 
all the 5s who scored below and half the 5s whose scores place them at this par¬ 
ticular point. Thus, the formula for percentile is: 

__no. below + 1/2 same 

Percentile = (100)-— 


= (100 > 


N 

51 + 1/2(18) 
75 


= 80 

This formula will work to help you locate any score in a distribution. It locates ! 
the score by placing it relative to the proportion of all scores at or below it. Of ; 
course, you can easily locate the score relative to the number at or above it simply 
by subtracting from 100. If you took a test, scored 84 and that placed you at the 
87th percentile, then 13% of the scores were at or higher than 84. 

In school research, you may find reports of quartiles and deciles as well as > 
percentiles. Quartiles and deciles are another way of locating individual scores 
within a distribution. The first quartile is the value where 25 percent (one- ; 
quarter) fall below the score. Seventy-five percent of the scores fall above the 
first quartile. However, if you say that a student's score is in the lowest quarter 
of the scores, you locate it anywhere in the first quartile. To say that a student 
scored at the 25th percentile is not the same as saying the student scored in the 
first quartile. 

Deciles locate individual scores according to tenths of the cumulative rank. .\ 
score at the first decile is located si* that 10".! fall below and 90% above the score. 
The second decile is at the point where 20% fall below’ anti 80% above the score. : 
Again, to say an individual places at the eighth decile is not the same thing as in \ 
the eighth decile. I he eighth decile locates a score so that 80% fall below anil : 
20% above. At least 79% of all scores fall below those at the eighth decile. A 
score in the eighth decile would be within the range between the eighth and ninth 
decile. 


Percentiles, quartiles, and deciles are similar in that they all locate an individual 
score relative to the number of scores below it. Scores of individuals are usually : 
located in terms of percentiles. Deciles and quartiles are more commonly used to 
locate schools in a distribution. For example, schools may be required to give a 
statewide exam. For a specific grade level, the school achieves an average exam 
score based on the performance of its students. These averages are then used to 
place the school in decile or quartile divisions for all schools in the state. 


Percentiles, quartiles, arid deciles leave some unanswered questions in regard to 
locating scores in a distribution. For example, assume you want to compare two 
applicants for admission to your school. Each was the most outstanding person 
in her school. The percentile ranks of each at their respective schools might be 
the same—the 99th percentile. Yet, we don't know if one person dwarfed every¬ 
one else at her school while the superiority of the other might have been slight at 
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i second school. Percentilc information on the two students leaves us in the 
rk. We need another way of locating a score or value in a distribution that is 
isitivc to such differences. This second method is to use standardized scores. 


ooooooooooooooooooooooooooooooooooo 

tictice 7.2 

Fill in the figures and compute the percentile rank for a student who scored 
on the exam on page 189. 


Percentile — (100' 


no. below + 1/2 same 


Percentile — (100)_ 


ltr2. In what quartile of the distribution would the student's score be placed? 
|_. In what decile?_. 

iiWhich method of placement--percentile, quartile, or decile—gives the most in¬ 
itiation? Why?_ 


■ 


pn the basis of a statewide reading test, your school was placed in the second 
Jrtilc of schools statewide. The local paper headlines this as among the poorest 
, the state. If the school serves mainly newly arrived immigrant students, what 
brmation might you ask for from the state department of education that would 
give you a more precise and appropriate placement for your school? _ 


POOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


■ 

I I 

ifii 


eating Comparable Scores in Distributions 


tandardized Scores 


m 


the previous chapter, we discussed ways in which we can report a typical value 
score for data. We also talked about three ways in which we could show how 
stered or dispersed other values or scores might be in the data. We noted that 
% method, standard deviation, provides us with a "ruler" that will help us 
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measure the dispersal of scores from the mean. That ruler can also help us lot i e 
individual scores in relation to the total distribution. 

When the distribution of scores is normal, the standard deviation can give us ex-i 
tremeiy useful information. One use not mentioned in the last chapter is that of 
allowing us to compare scores of individuals where different tests have been used 
to measure performance. In the last chapter vve asked you to select two classes' 
to teach, given information on scores and standard deviations. Imagine, however ; 
that this information came from two different tests--say, the department's ESI/ 
screening exam was used in two classes and a national test of English proficiency- 
for the others. Scores that come from two different tests (although they purport; 
to measure the same thing) do not allow us to make the needed comparison. 

If the scores form a normal distribution and if you have the mean and standard: 
deviation for each, there is an easy way to make such comparisons. This involves 
standardized scores. 


Standardized Scores and the Normal Distribution 

Before continuing, let's review the concept of normal distribution, for it is basic 
to the notion of standardized scores (and to many other procedures). 

All of us like to think that we are unique, different from everyone else in what 
vve can do. At the same time, we think of ourselves as being like all other humans: 
in our capacities. In fact, the outcome of any sound measurement of human ac-; 
tivity or behavior approximates a noirnal distribution. No matter what kind of 
behavior is measured or the type of measurement employed, the distribution of: 
data in large samples tends to be normal. I rue, there will always be some peoplei 
who score very high and some people who score very low on whatever we mca-; 
sure, but most of the data will fall around the point of central tendency and the 
dispersion of the data away from that point will be symmetrical. 

How long do you think it would take you to learn 50 new vocabulary items in a 
language you do not now know? How different do you think your time to reach; 
this criterion would be from that of your friends? Think of the best language 
learner you know. How much faster do you think this person might be at this ; 
task? Think of the worst language learner you know. How much longer do you: 
think this person might take? Given this information, you could roughly estimate, 
what a normal performance distribution would look like. 

If you actually timed learning of these 50 items, you could begin tabulating the 
time period for each person. As you tested more and more people, you would: 
find a curve emerging with most people around the middle point of the distri¬ 
bution (the central value). There would be a smaller number of people who would 
do better than the majority of people, as well as a smaller number who would not 
do as well. If you began the data collection with students enrolled in a private = 
foreign language school, the times might be quite rapid and the data might cluster 
very tightly around the mean. If you then collected data frum young children, 
you might find they clustered around a much slower point on the time scale,; 
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Inducing a bimodal distribution (one.normal distribution for adults and one for 
' ildren forming a bimodal curve). 



fyou tested more and more learners of all ages, these differences in the distri- 
ition would gradually become incorporated into a new overall curve and, again, 
:bst of the scores would cluster around one central value. 



a 

•si 


o matter what kind of behavior is measured, when large amounts of data are 
thered, performance will approach this normal distribution. If a large number 
students already enrolled in a graduate program are required to "pass" the ETS 
t in some foreign language (say French) for a Ph.D., you might imagine (given 
eir sophistication in test-taking after this many years in school and their moti- 
ation) their scores would be high. This is true. Nevertheless, given the large 
umber of 5s in this group who would take the test, their scores would approach 
normal distribution. The scores would spread out symmetrically on either side 
f the mean. We would expect that the standard deviation would be small. That 
.Ji there would not likely be a wide dispersion of scores away from the mean. 
Rather, w r e would expect them to cluster fairly tightly around the high mean 
Ibore. 
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Now imagine what the distribution on the ETS French test might be if were ad¬ 
ministered to all sorts of people from fifth grade on to students studying French 
in retirement communities. With a large number of scores, once again the distri¬ 
bution would approach the normal curve. The dispersion of scores would be Wide 
(the standard deviation large) as the scores spread symmetrically away from the: 
mean. 

In each case, assuming a fairly large data sample has been collected, the distri¬ 
bution will tend to be normal. However, the distribution represents what is 
normal for the group from which that sample is drawn. 

Given the fact that all of us share human abilities, it seems logical that human: 
performance should approach a normal distribution. We do, of course, differ in 
how well or how fast we can do various tasks, but our behavior will fit in the 
normal range of behaviors. All of this seems logical enough. Yet, in fact, the 
normal distribution does not actually exist. We never get a completely normal 
data distribution. Normal distribution is an idealized concept. Still, the more 
data we collect, the closer wc get to the normal distribution. So, we need a great! 
deal of data if we hope to say much about general human language learning; 
abilities. That is, we can't rely on the data from students enrolled in a private! 
foreign language school to give us a normal distribution for all learners. The 5$| 
are not randomly selected as representative of all learners. In addition, the sam¬ 
ple is not large enough. We can't rely on the distribution of 100 Ph.D. candidates 
on the ETS test of French to reflect that of all learners of French. The scores 
will be very close to a normal distribution, but obviously the distribution won't 
be representative of all learners. 

The normal distribution has three important characteristics: 

1. The mean, median, and mode in a normal distribution are all the same. 

2. The first property results in the second characteristic—the distribution is bell 
shaped and symmetric. 

3. The normal distribution has no zero score; the tails never meet the straight 
line but stretch to infinity in both directions. 
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SjYou know that half the scores fall above and half below the median. And since 
•^the mean and the median are the same in a normal distribution, the same can be 
;said regarding the mean. Between the mean (X) plus or minus one standard de¬ 
viation (+ Is), we can expect to find 68% (more precisely, 68.26%) of the obser¬ 
vations. Between X and + 2s, 95% (more precisely, 95.44%) of the observations 
rare accounted for. And between X and ± 3s , 99% (more precisely, 99.72%) of 
ithe data are entered. This leaves only 1% of the data to fall in the space shown 
Vnder the outer stretch of the tails. 



These proportions show the normal distribution of data for any behavior. This 
does not mean that all normal distributions look exactly like that shown in the 
above curve. Each of the following curves show a normal distribution. We can 
assume that in each case the mean, median, and mode are the same. None has 
a zero value and the tails never meet the straight line. The distribution is bell- 
shaped as the distribution of the scores from the mean spreads out symmetrically 
on either side of the mean. They differ in how widely the scores are dispersed 
from the the mean. In some cases, the scores cluster around the mean. The 
standard deviation is small. In other cases, the scores spread out from the mean 
fso that the distribution is rather flat. The standard deviation is large. 
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If we have a normal distribution, we can locate scores in relation to each other 
when we know the mean and standard deviation. For example, if we know that 
the mean for our ESL placement exam is 500 and the standard deviation is 50, 
we can locate the scores of individual students in relation to that distribution. If 
a student scored 550 (1 s above the mean), we know that the score is higher than 
84% of the scores. To sec that this is the case, you need only look back at the 
bell-shaped curve and the percentage Figures displayed there. 

If another student scored 400, the score is two standard deviations below the 
mean. From the normal distribution, we know that less than 3% of all scores 
will fall below this point. If a student scored 3 standard deviations above the 
mean, a score of 650, less than 1% of the scores would be above that point, and 
99% would be below. 

We have already shown, in talking about the normal distribution, that informa¬ 
tion on the the mean and standard deviation-coupled with the information on 
characteristics of normal distribution-allow us to locate individual scores in a 
distribution. When the distribution is normal, the characteristics of normal 
distribution also allow us to compute percentile ranks for all the observations in 
the data. However, to show what happens when the distribution is not normal, 
we will need to turn to the notion of z and T scores. First, though, review normal 
distribution by completing this practice. 


00000000000000-00000000000000000000000 

Practice 7.3 

1. To conceptualize the normal distribution, consider how long you would be able 
to retain the meanings and forms of the 50 new words mentioned on page 192. 
In terms of retention, for how long a period do you think other people might be 
able to remember all fifty items?_ 


What do you imagine the mean length of time might be? 
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f-JoW wide a range of lime would be shown? 


Do you think the distribution would be normal? Why (not)? 


Would the distribution curve be flat (large standard deviation) or peaked (small 
standard deviation)?_ 


2. Imagine that you had the task of determining the cutoff point for fulfilling the 
language requirement for your Ph.D. program. ETS has sent you the scores of 
all your students on the exam and the mean and standard deviation for 3,000 
humanities graduate students who took the test. How would you decide on the 
cutoff point? (Would you set it, say, 2 s.d. below the mean, assuming the scores 
are all incredibly high to begin with so that anyone attaining such a score could 
reasonably be expected to meet the requirement? Would you set it at I s.d. above 
the mean, assuming that your Ph.D. students should be somewhere "above aver¬ 
age"?) Justify your choice. _ 


3. Since we know’ that 68% of all scores in a normal distribution should fall 
within 1 s above and below the mean, 95% within 2 s above and below the mean, 
and 99% within 3 s above and below the mean, you can estimate where 1 s, 2 s, 
and 3 s occur in each distribution on page 195. Mark these estimates on the four 

polj-gons. In which is the standard deviation largest?_In which 

is it smallest? _ Do means and standard deviations have fixed 

values in all distributions? Why (not)?_ 


If you're not sure, discuss this question in your study group and report your 
consensus below. 


ooooooooooooooooooooooooooooooooooooo 

Z Scores 

One of the most common ways of locating an individual score in relation to the 
distribution is to use a z score . To find it, we use standard deviation. Think, once 
again, of standard deviation as a ruler. Instead of being 12 inches long, it is I 
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standard deviation in length. The . score just tells us how many standard dev. 
ations above or below the mean any score or observation might be. So, first \\ 
look to sec how far the score is from the mean (X X). Then we divide this in 
div idual deviation by the standard deviation. This allows us to see how mar 
ruler lengths—how many standard deviations-the score is from the mean. Thi 
is the 2 score. 


By finding the z score, we know where the score actually falls in the total distri 
button. For example, if a student scored 600 on the ESL placement test (X = 
500. s = 50), then the 2 score would be -2. 


The percentile for this score, assuming a normal distribution, is 98 (actually 
97.72). If the 2 score were - 2. the percent lc score would be 2 (actually 2.27). 

Of course, scores arc not always exactb I, 2, or .7 standard deviations away from ; 
the mean. Look at the following diagram. We gave a test to six people. Their 
scores are shown on the diagram below. 



The mean, where the balance point is, was 50. The standard deviation turned 
out to be 25. The little ruler represents one standard deviation (1 z score i 
length, right?). Congratulations! You are the person who scored 90 on this test. 
Your 2 score is 1.6. One of the authors of this book is the person who scored 20 
Her z score is — 1.2. 
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To see how percentiles and z scores are linked in a normal distribution, examine 
•the following chart which gives a more precise breakdown of the proportions of 
data in each segment of the curve in a normal distribution. 



,1 2 16 50 84 98 99.9 


|From this display, you might think that the percentile information is more im¬ 
portant than z scores, or that the two are really the same thing. Neither is really 
true. 

It is true that nobody would ever want to know that their individual _■ score was 
-2.1 or, worse yet, -1.4. Percentiles give information that makes more sense to 
most test-takers. However, z scores are useful in many ways. Imagine that you 
conducted an in-service course for ESL teachers. To receive university credit for 
; the course, the teachers must take examinations—in this case, a midterm and a 
final. The midterm was a multiple-choice test, of 50 items and the final exam 
presented teachers with 10 problem situations to solve. Sue, like most teachers, 
was a whiz at taking multiple-choice exams, but bombed out on the problcm- 
! solving final exam. She received a 48 on the midterm and a 1 on the final. Becky 
didn't do so well on the midterm. She kept thinking of exceptions to answers on 
j the multiple-choice exam. Her score was 39. However, she really did shine on 
the final, scoring a 10. Since you expect students to do well on both exams, you 
reason that Becky has done a creditable job on each and Sue has not. Becky gets 
the higher grade. Yet, if you add the points together, Sue has 49 and Becky has 
49. The question is whether the points are really equal. 

Should Sue also do this bit of arithmetic, she might come to your office to com¬ 
plain of the injustice of it all. How will you show her that the value of each point 
on the two tests is different? It's z score information to the rescue! By converting 
the scores to a scores, it is possible to obtain equal units of measurement for the 
two tests even though the original units were quite different. That is, Sue picked 
up points on an easy multiple-choice test while Becky earned points when the 
earning was hard. As a general rule, then, when you want to compare perfor¬ 
mance over two tests which have different units of measurement, it's important 
I to convert the scores to z scores. 
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c.ji\ cn z scores, it Is easy to convert them back again to raw scores. 1 or example, 
if the mean on the midterm was 40 and the standard deviation was 5 and a stu¬ 
dent's 2 score was - I, then 


x-x = Mz) 

X = {s\z)+X 
X = (5X — 1) + 40 
X = 35 

:• scores arc useful in locating individual scores in a distribution. They arc also 
useful when decisions must be made and data come from tests with different units 
of measurement. Percentiles and z scores both locate scores in a distribution but 
they do :t it: slightly different ways. 

scores, thus, arc as useful as percentiles in locating individual scores in a distri¬ 
bution. Let's be sure we have dispelled the second notion as well-the notion that 
percentiles and 2 scores arc basically the same thing. The chart on page 199 may 
be misleading, for it appears as though percentiles and z scores have an inherent 
equivalence. This is not the case. It is possible that identical z scores in two 
different distributions could have very different percentile ranks. You can sec 
this is the case if you think of a distribution which is not normal. Imagine that 
five people took a test and they scored 2, 3, 6, 8, and 11 The mean is 6. Two 
people scored below the mean. They have negative z scores. Two scored above 
the mean. These z scores arc positive. One person scored right on the balance 
point, 6. Here are the computed z scores and percentile ranks: 


Score 

z score 

%ile 

2 

-1.09 

10 

3 

-0.82 

30 

6 

0.00 

50 

8 

0.54 

70 

11 

+ 1.36 

90 


200 
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If we locate each score (draw a small "brick" as a weight) on a distribution plank, 
it looks like this: 

_no_ n n _o_ 

23456789 10 11 

A 

Now imagine that the scores of the five people were 0, 1,2, 3, 14. The mean is 
4 . Four scores arc below the mean. In the previous chart, the mean score was 6 
and half the scores were at or below the mean. This is no longer the case, as you 
can sec in the illustration. 


nano_a 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 

A 


Here are the computed z scores and percentile ranks for these five scores. 


Score 

z score 

Voile 

0 

-0.70 

10 

1 

-0.53 

30 

2 

-0.35 

50 

3 

-0.17 

70 

14 

1.75 

90 


Let's try to summarize this (without confusing you more). There is no inherent 
equivalence of z scores and percentiles. Each locates a score in the distribution 
in a different way. The percentile tells what percent of the scores in the distri¬ 
bution are at or below it. Does it say anything about its relationship to the mean? 
No, it doesn't. If one person receives a percentile rank of 98 and another a 
percentile rank of 99, does this necessarily mean that their scores are close to each 
other? No, it doesn't. A percentile just locates each score in proportion to the 
:>:• number of scores below it. The z score locates an individual score in relation to 
the mean but, unless the distribution is normal, it does nut locate it in terms of 
the percentage of scores that fall above or below it. While both measures locate 
a score in a distribution, they do so in different ways. 


T Scores 

Another method of locating an individual score in a distribution is the standard 
T score. As you have no doubt noticed, z scores may be either positive or nega¬ 
tive numbers. (If you add all the z scores in a distribution, the answer will be 
zero.) In addition, they often contain decimal points (a z score might be 1.8 
standard deviations from the mean rather than just 1 or 2). This makes for error 
in reporting (it's easy to make a typo on decimals and + and — symbols). T 
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M-ines seem easier to interpret since they are always positive numbers and contain 
no fractions. 

First, to get. rid of all the fractions, the * score is multiplied by 10. So. a - score 
of 1.4 would be 14 at this point. A z score of .3 would be 3. lo be sure we get 
rid of minus values, the mean (which was 0 for z scores) is changed to 50. 

I he mean of the /'distribution is set at 50 instead of at 0 and the standard de¬ 
viation of T scores is 10. To calculate any T score, simply find the z score and 
convert it to a T score: 

T score = 10(z) + 50 

Since the mean is set at 50, we are sure to end up with a positive value. And by 
multiplying the z score by a set standard deviation of 10 and rounding off, we 
also come up with whole numbers instead of fractions. If the actual z score were 
3.2 on a test, we could multiply by 10 to get the whole number 32. Then, adding 
the mean of 50, we get a T score of 82. 

This conversion from z to T scores may not seem very important, but when you 
are reporting hundreds of scores, it's easier to convert them than to type in all the 
decimals and plus and minus values. 

Whenever we compare or combine scores which have different units of mea¬ 
surement, we first convert them to standard scores. The choice of z or T scores 
is up to the researcher. If your interest is in language testing, there are other 
standardized scores, such as stanincs and CEEBs (College Entrance Examination 
Boards), which you may encounter. We refer you to Henning (1987) for infor¬ 
mation on these types of standardized scores. 


ooooooooooooooooooooooooooooooooooooo 

Practice 7.4 

► 1. The X for a reading test was 38 and the standard deviation 6. l-ind the z 
score for each of the five following raw scores: 

Score z score 

38 _ 

39 _ 

30 _ (Be careful to subtract!) 

50 _ 

25 
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p.inti the /‘score for the above scores: 

Score z score 

38 

39 
30 
50 
25 


2 . If you converted all_your data to T scores, why would you not ask the com¬ 
puter ’.0 calculate the X and standard deviation? What would the answer be if 
you did?_ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOC 

Distributions with Nominal Data 

So far in this chapter wc have noted that school researchers may locate the 
performance of individual students (or of individual schools) in a distribution by 
using percentile, quartile, or decile ranks. When comparisons need to be made 
land information is drawn from tests where different units of measurement have 
been used, researchers locate scores in each distribution in terms of standard 
scoies such as 7 or 7 scores. It doesn't make sense, however, to think of 
percentiles or z scores lor nominal data. That is, if you look at sex as a variable 
you can obtain a frequency count of the number of boys and girls in a sample 
but you can’t locate anyone at the 99th percentile or give anyone a 2.1 ;; score. 

However, there are times when we want to sec if a distribution exists within a 
whole series of nominal data frequency counts and whether observations (i.e., 
;Ss, pieces of text, test items) can be reliably rank-ordered in the distribution. 
Again, this relates to normal performance on a set of variables. Think, for in¬ 
stance. of questionnaires you get in the junk mail or that you read in the Sunday 
paper. The questionnaires purport to measure how sensitive you are to 
extrasensory experiences, how well adjusted you are, or even how multilingual 
you are. The questions (each answered and thus measured as a yes no category) 
might be: 

1. Can you say Merry Christmas in another language? 

2. Can you say Good morning in three languages? 

3. Can you order a cup of coffee in four different languages? 

4. Can you bargain for vegetables in five languages? 

5. Can you write a letter to six friends each of whom speaks a different 
language? 

6 . Can you translate the Bill of Rights into six languages? 
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I: is likely that almost everyone can say Merry C hristmas in another language. 
If this were a nominal variable in a researen project, most people would score a 
1 ( yes) rather than a 0 ( no). Many people would also score a 1 on the next • 
dichotomy; they would be able to say Good morning in three languages, fhe 
further we go through the list of questions-eaeft a nominal yes/'no variable—fewefl 
and fewer people would answer yes and more and more Os would be entered onto 
the data sheet. That is, the questions arc arranged in a scaled order so that fewer 
and fewer people can claim to possess these attributes. When items are arranged 
in this way, most of the Is will appear as a peak at the bottom of the scale and 
there will be a gradual decrease in frequency as the attributes are less and less 
possible in human performance. In a sense, the distribution curve looks a bit like 
half of a normal distribution polygon. There will always be a few people who do 
better than others and they arc the ones who will win the Sunday paper's contest ■ 
to find the "most multilingual person alive today." 

This method of finding a scale in a set of dichotomous (yes/no) items is the basis ; 
of the Guttman procedure, often called Impiicational scaling in applied linguistics 
research. The procedure has proved an extremely useful one for research on 
language learning. 


Impiicational St tiling 

Much research in applied linguistics is aimed ai trying to discover the orderliness 
(or lack thereof) in the language learning process. Of course the learning process 
of interest is not that reflected in the Sunday paper questionnaire. Rather, 
inter'language research has as one of its goals documenting the gradual acquisition 
of the grammatical, lexical, and phonological features of language by learners 
over time. (Discourse features in intcrlanguagc arc also amenable to 
lmplica ional scaling. Although we know of no studies doing this, wc don't see: 
in theory why it couldn't be done.) One of several motivations for documenting 
t’nc acquisition of these features relates to language teaching. In syllabus design 
and materials development, writers assume they know which structures, vocabu¬ 
lary, and phonology will be easiest to learn and they begin with lessons to teach 
this material first. They gradually increase the complexity of the material they 
present to learners. But much of this sequencing of teaching materials is based 
on intuition of what is easy or difficult. Intcrlanguagc research seeks to document; 
a learning sequence for learners by examining the language they produce. 

Much of this research is observational; often studies report on only two or three 
Ss (and sometimes only one). The data from these observational studies are 
compiled and then compared. If similarities are observed, some sort of universal 
in terms of stages in acquiring various structures can be proposed. Many of the 
studies done in the 1970s on morpheme acquisition fit into this research pattern, 
so let's begin with an example of how the analysis might be done with 
Implicaiional scaling. 
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Scaling 


Suppose that you were interested in the order in which English morphemes are 
Inquired. The data you have might come from observational studies where you 
fl&ve charted the presence or absence of the morphemes in the speech of learners 
$yer several weeks or months. 

Imagine that you were interested in the sequence of appearance of five different 
Irhorphemcs. For each S you prepared a chart and noted the presence or absence 


(bfeach morpheme each week (1 = yes and 0 - no). 

Morphemes 

Difficult -* Easy 

Week 

Ml 

M2 

M3 

M4 M5 

5 

1 

I 

1 

1 1 

4 

0 

1 

1 

1 1 

u. 3 

0 

0 

1 

1 1 

g 2 

0 

0 

0 

1 1 

it I 

0 

0 

0 

0 1 


The matrix shows that at the first data-collection session, week 1, the 5 produced 
'one of the morphemes (M5) but none of the others. In session two, the same 
(morpheme was present and another appeared (M4). Magically, week by week 
(the learner acquired one more morpheme (and continued to use the ones learned 
(in the previous weeks). In real life, of course, the process is never this neat and 
Tidy. 

(Since the morphemes in the data appeared one after the other, it is possible to 
(hypothesize a scale of difficulty from easy (for those acquired first) to difficult 
((for those acquired late). Prior to collecting the data the researcher may have 
(little reason to suspect which might be easy and which difficult. As data from 
(more and more learners are gathered, and if they fit the pattern found for the first 
learner, the researcher can claim that the variables form a unidimensional scale, 
and ultimately even decide to pose a natural-order hypothesis. 

The claim that variables can be arranged in a scale order of difficulty can also 
be discovered using cross-sectional data. Realizing you haven't sufficient time to 
devote to observational research, you may have devised a game—or used the 
SLOPE (1975) or Bilingual Syntax Measure (1973)—to elicit language data from 
many learners at one point in time. Again, you would chart the presence or ab¬ 
sence of the morphemes in the speech of these learners. The learners will not all 


behave in exactly the same 

way. Some will know and produce all of the 

morphemes; others none. 

Once charted, the data might look like this: 

Morphemes 


Ml 

M2 M3 M4 M5 

S 1 

1 s 2 

1 

0 

I 1 1 1 

1111 

s 3 

0 

0 111 

s 4 

0 

0 0 11 

s 5 

0 

0 0 0 1 
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I rom (his chart you should be able to see that Implications! scaling works in two 
ways. (I) It lets us arrange dichotomous items (yes.no attributes) to show their 
difficulty. That is, the more 5s that can perform with a "yes" on the variable, the 
easier the behavior must be. (2) It lets us arrange 5s in a performance order. 
That is. the more times an individual 5 can perform with a "yes" on the set cf 
variables, the higher the 5 ranks on performance with respect to other 5s. Thus, 
implicational scales locate 5s in a distribution and establish a scale for the items 
as well. 

When a data matrix is neat and tidy as ours have been, it is not difficult to sec 
how items and learners should be arranged. The arrangement allows us to make: 
predictions about the performance of other learners. We expect that a new 5 who 
produces morpheme 1 will also be able to produce morphemes 5, 4, 3, and 2. 
Given that knowledge, wc would know where the learner should be placed on the; 
matrix--i.c., at the top of the table. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 
Practice 7.5 

1. In tire second morpheme chart (page 20.5), which learner produced the fewest: 

morphemes? _. Which produced the most?_. Which morpheme 

appeared to be the easiest? _. The most difficult?_. 

2. If a 5 produced morpheme 3 correctly, what predictions could we make? 


ooooooooooooooooooooooooooooooooooooo ; 

Unfortunately, data are never this neat and tidy. Every piece of data does not 
fit or match such scaling. For example, if we gave the Bilingual Syntax Measure * 
to absolute beginners, none of them would probably be able to produce any of the 
morphemes. Zeros would fill the matrix, and no order of difficulty and no order! 
of 5s could be found. Similarly, if we gave the measure io advanced students, the 
matrix would be filled with Is and nothing could be said about the distribution. 
Nor could wc locate the performance of these individuals relative to each other 
since ail perform in precisely the same way. 

This points out a reservation regarding Implicational scaling that should be obv 
vious. If all the items are roughly the same difficulty, no scale will be found. 
Conversely, if all the Ss are roughly at the same level, no scale in terms of their 
position in relation to each other will be discovered. 

However, if we do have 5s at differing levels of proficiency (or the same .S's at' 
longitudinally different points on their learning continuum), we should be able to 
scale items for difficulty. However, as wc chart the data, it is unlikely that it will; 
fall into a pattern precisely like those shown above. 
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-IFurn back to the Sunday paper's questionnaire on multilingualism. If vve col¬ 
lected answers from six learners for these six questions, it is unlikely that the data 
^r 0 uld allow us to form a perfect idealized matrix. As with all performance, there 
4 vill be error. Measuring the error is crucial in Implicational scaling. Since we 
are trying to discover a scale, the degree to which the data fit the idealized model 
depends on the degree of error. Error, here, refers to the number of entries in the 
matrix that violate the ideal model. If there were no error, the scaling would be 
perfect. Not only could we scale the items for difficulty but wc could also abso¬ 
lutely predict that an individual 5 will know certain items but not others simply 
by his or her position in the matrix. 

When the learner knows something wc didn't predict she would know or when 
she doesn't know something we predict she will know, it's an error. The following 
matrix shows two deviations from the ideal model. 

Questions 

Easy 

Q4 Q3 Q2 Q1 

®iii 
1111 
1111 
0 I 1 1 1 

0 0 11 1 
0 0 (T"l 1 

Notice that 56 answered yes for live questions, including quest ions 5 and 6, 
which are the most difficult. For some reason, however, the 5 answered no to 
question 4. This shouldn't happen. We expect, from the performance of other 
•5s lower on the scale, that S6 should have a yes. 

jvfow let's see how the Tine" is drawn on the table because it is this line that shows 
us where "errors" occur (that is where 5s missed items they were expected to get 
correct or where they got items right but were expected to miss them). Start at 
the bottom of the chart. 51 got one question correct, so we draw a vertical line 
between the last two rows (between Ql and Q2). 52 got two questions correct, 
So We draw a vertical line between columns 4 and 5 (between Q2 and Q3). 53 
had three correct questions and so the vertical line moves over one more column 
(between Q3 and Q4). 54 had four correct, so the line moves over another col¬ 
umn (between Q4 and Q5). 55 had five, so the line is now between Q5 and Q6. 
56 also had five and so the line stays between Q5 and Q6. Thus, to locate the line, 
count the number of correct responses and place the line to the left of that num¬ 
ber of columns. 

Now to find the errors, look to the right of the line. Everything to the right of the 
line should be Is. The 0 for Q4, which is not a 1, is an "error." Next look to the 
left of the line. Everything to the left of the line should be a 0. The 1 on Q6 is 




S6 © 
S5 0 


Difficult 
Q6 Q5 
1 
1 


S4 

S3 


S2 0 
SI o 
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the second error. Thus, there are two errors in the data set. Not only did Sg 
miss an item that should not have been missed (given the pattern in the data) 
but 56 also answered an item correctly where the 5's total score predicts the item' 
would be missed. 

Some research studies actually display a matrix as evidence to support thdi r 
claims that particular structures can be ordered or scaled for difficulty on the 
basis of the performance of learners. Since a real data matrix won't match those 
given so far, the dividing line between the Os and Is in responses won't be as ev- | 
ident. For further practice, let's try this with a real data set. The following data 
were collected by Lazaraton (1988). This is a pilot replication of a study in which 
Scarcella (1983) suggested there might be a performance order for how well 
learners are able to do Goffman's (1976) univcrsals of communication (such as 
conversational openings, closings, turn-taking and so forth). 5s took part in an 
oral interview where they were expected to give an opening greeting, perform a 
closing, respond to two preclosing moves, give a self-introduction, and respond to 
the interviewer's self-introduction. The data were entered for each 5 on a data 
sheet. Notice that the 5s and the "questions" are not yet ordered in any mean¬ 
ingful way. (That's the hard part!) 


Question 


ID 

1 

2 

3 

4 

5 

6 

1 

2 

1 

1 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

3 

1 

0 

0 

1 

1 

1 

4 

1 

0 

1 

0 

1 

1 

5 

1 

0 

0 

1 

1 

1 

6 

1 

0 

0 

1 

1 

1 

7 

1 

0 

0 

0 

1 

1 

8 

1 

1 

1 

1 

I 

1 

9 

1 

1 

1 

1 

1 

1 

10 

1 

0 

1 

0 

0 

1 

11 

0 

0 

0 

1 

1 

1 

12 

1 

0 

1 

0 

0 

1 

13 

1 

0 

0 

1 

0 

1 

14 

1 

0 

0 

0 

0 

1 

15 

1 

0 

0 

1 

l 

1 

16 

1 

0 

0 

1 

0 

0 

17 

1 

0 

1 

0 

1 

1 

18 

1 

0 

0 

1 

1 

1 

19 

1 

1 

0 

1 

1 

0 


Now the hard part-constructing the actual matrix. The 5s must be rank-ordered 
so that the 5 with the lowest number of correct responses is at the bottom of the 
list and the 5 with the most correct responses is at the top of the list. Simul¬ 
taneously, the communicative moves must be rank-ordered for difficulty on the 
chart (with the most difficult item first and the easiest last). The data are dis¬ 
played in the following chart. 
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SIntro RIntro PreCl PreC2 Close Grtg 
ID Q2 Q3 Q4 Q5Q6 Q1 


1 

8 

1 

1 

1 

1 

1 

1 

1 

1 

1 

I 

1 

1 

9 

1 

1 

1 

1 

1 

1 

2 

1 

0 

1 

1 

1 

1 

3 

0 

0 

I 

1 

1 

1 

4 

0 

1 

0 

1 

1 

1 

6 

0 

0 

1 

1 

1 

1 

15 

0 

0 

1 

1 

1 

1 

18 

0 

0 

1 

1 

1 

1 

17 

0 

1 

0 

1 

1 

1 

5 

0 

0 

1 

1 

1 

1 

19 

l 

0 

1 

1 

0 

1 

7 

0 

0 

0 

1 

1 

1 

10 

0 

1 

0 

0 

1 

1 

11 

0 

0 

1 

1 

1 

0 

12 

0 

1 

0 

0 

1 

1 

13 

0 

0 

I 

0 

1 

l 

14 

0 

0 

0 

0 

1 

1 

16 

0 

0 

1 

0 

0 

I 

Tut 

14 5 

12 7 

6 13 

5 14 

2 17 

1 18 


Notice the row of totals at the bottom of each column. For some reason, these 
totals are called marginals. The First number is the total number of Os and the 
second is the total for Is. Since the tables for the Guttman procedure are tedious 
to set up, it is easier to let the computer do the analysis for you. It will set up the 
table and print this out as a scalogram . The scalogram will show you precisely 
where the "errors" are located. However, to do this by hand, we must repeat the 
jlpcedure of drawing in the lines on the chart. The directions are the same as 
before. 


ooooooooooooooooooooooooooooooooooooo 

Practice 7.6 

► 1. To draw the dividing line on the chart, it is easiest to start at the bottom of 

the chart. How many people score 2--that is, got two items correct?_. So 

we draw a vertical line 2 columns from the right (between PreC2 and Close) for 
those 5s. How many people have three correct items (remember it doesn't matter 
at this point which items they got correct, just how many items are correct for each 

5 )? _. Continue the vertical line between columns 3 and 4 for these 5s. 

How many have four correct items?_. Draw a vertical line four columns 

from the right (between RIntro and PreCl) for these 5s. How many got 5 cor¬ 
rect? _. Move the line over an additional column for this 5. How many got 

all [six items correct?_. The vertical line is now six columns from the right 

for these 5s. 
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► 2. Now look to the light of the lines. Circle everything that is not a 1. Th 
are "errors." Ss got items wrong which they were expected to know. Look to 
left of the lines. Circle every error (those that are not Os.) How many errors; 
there in the data?_. This time, what does an "error" mean? j 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOli 

In applied linguistics, some researchers set up an implicational table (includ 
the lines) and stop there. For them a visual pattern is evidence enough that 
scale exists. If you do this, don't be surprised if Grandmother Borgese stands^ 
to say "I don't believe it!" So let's use a statistical procedure to give extra wei 
to the evidence. 

When you run the Guttman procedure on the computer, it will produce sevi 
statistics for you. These include the coefficient of reproducibility , the minim 
marginal reproducibility, and the percent improvement. These are steps towa 
determining the coefficient of scalability, the figure you want. Let's explain eh 
of these steps in order to determine the final scalability coefficient. We'll do t 
first with the data in the chart on page 207 and then you can repeat the proc 
using the pilot study data in the practice section. 

Step l: Coefficient of Reproducibility 

The coefficient of reproducibility tells us how easily we can predict a S' s perl 
mance from that person's position or rank in the matrix. The formula, shou 
you wish to compute this yourself, is: 

£ _ j_ number of errors _ 

rep (number of SsXnumber of items) 

In the morpheme example (page 207), we want to know whether we can cla 
that the items form a true scale. There are 6 Ss doing 6 items with a total o; 


This means that almost 95% of the time we could accurately predict which 
questions a person answered "yes" or correctly by his or her rank in the matrix. 
By convention, mathematicians have determined that the value of the coefficient 
of reproducibility should be over .90 before the scale can be considered "valid." 
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2: Minimum Marginal Reproducibility 


he minimum marginal reproducibility figure tells us how well we could predict 
we did not consider the errors (the places where people behave in ways rot 
redicted by the model). The formula for this is: 




maximum marginals 

MM =- 

rep (number of SsXnumber of items) 


hile the previous formula took into account all the errors, this one doesn't. Our 
bility to predict Ss' performance accurately will be greater when we pay atten- 
ion to error than when we don't. This formula doesn't pay attention to error. 




he only part of the formula you might not be able to decipher is the maximum 
arginals." This refers to the totals at the bottom of each column. For each 
olumn select the larger value , whether it is the total for Os or for Is. Add these 
o obtain the value for "maximum marginals." In the example, we sum the 
umber of Os and Is for each question as shown. For each column, we take the 
arger number and sum these. 


max marg =5+4+4+4+5+6 
max marg = 28 

5 can now insert this into our formula for minimum marginal reproducibility: 


28 

rep (6X61 


ur answer for these data should be less than .945 (the value of the coefficient 
f reproducibility for these data). That is, if we don't considei the errois, we 
on't be able to reproduce as well a S' s performance based on the S' s position in 
he matrix. 

tep 3: Percent Improvement in Reproducibility 

ercent improvement in reproducibility just shows how much improvement there 
s between the coefficient of reproducibility and the minimum marginal 
eproducibility. 




% improvement = C rep — MM rep 
% improvement = .945 — .7778 = .1672 
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Step 4: Coefficient of Scalability 

1 he coefficient of scalability is the figure that indicates whether a given set y 
features are truly scalable (and unidimcnsional). It is this figure that is usually 
reported in studies that use Implicational scaling. It is equal to the percent in’, 
provement divided by I minus the minimum marginal reproducibility. 


% improvement in reproducibility 


For our example, the coefficient of scalability is: 


If we reported the findings, we could now claim that these data are scalable 
While the scalogram does not show us a perfectly scaled data set, a strong patter 
of scalability is shown. Our coefficient of scalability is high, indicating a tru 
scale in the data. Statisticians have determined that the coefficient of scalability 
must be above .60 before we claim scalability. 

While it is not crucial that you understand the formulas given above for deter¬ 
mining scalability, it is important that you understand that the Guttman proce¬ 
dure is testing the distribution of Is and Os for a series of variables to see if they 
can be ordered for difficulty. By preparing a matrix, you can locate individual 
Ss who may have achievement patterns that differ from other students. For ex 
ample, it will locate Ss who have not acquired forms that their general level of 
achievement would suggest they had attained. It will also locate Ss who have 
acquired forms that, from their general level of achievement, you would not pre 
diet. This is useful information in language testing as well as in interlanguage 
analysis. And the coefficient of scalability figure should help convince you and 
your audience that the claims you make regarding the scalability of your data are 
correct. The procedure is descriptive ; that is, the scale belongs to the particular 
data set from which it was drawn. Before we can generalize from these findings, 
further replication studies would be needed. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 
Pnu t ice 7.7 

► I. Let's repeat the procedure now using the pilot data collected to replicate : 
Scarceila’s study. 


a. There arc 19 5s and 6 items with 18 errors in data set. Calculate the cocffi- : 
cieni of reproducibility. 


212 The Research Manual 




£._1 — number of errors _ 

rep (number of SsXnumber of items) 


,t this point, docs it appear that the data are scalable? How much better than 
fifty-fifty chance have you of predicting a person’s performance given his or her 
lace in the matrix? 


Calculate the minimum marginal reproducibility. 

_ maximum marginals 

rep (number of SsXnumber of items) 


i he value for MM rep should be smaller than that for C rep . If it is not, check the 
nswer key. 

The percent improvement in reproducibility is found by subtracting \l\t.^ P 
rom C. ep . The percent improvement for this example is_. 

Now. the final step is to apply the coefficient of scalability formula: 


% improvement in reproducibility 


The data do not show a clear scale. The Cscal is not large enough that we can 
eel confident in saying that a scale exists. Since the results do not support 
carcella s hypothesis, what explanations can you offer as to why this might be 
he case?_ 


i. If the data had shown a scale, could you claim that there is a natural order 
or second language learner performance on these communication factors? Why 
not)?_ 


>000000000000000000000000000000000000 

mplicational scaling is easily done using the Guttman procedure in many com- 
>utcr software packages. It is tedious to carry the procedure out by hand. The 
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calculations arc simple, hut arranging the data in a matrix can be confusing. I 
review, the procedure for preparing the matrix is: 


1. Compute the number of Is for each item. 


2. Rank-order the items from difficult to easy across the top of the matrix on 
this basis. 


3. Compute the number of Is obtained by each S. 


The next step is to locate the errors. 


4. Rank-order the 5s in the matrix with the best student at the top and lowest : 
at the bottom. 


Once the data are placed in the matrix, the line marking the division between the ! 
Os and Is is drawn in. To do this, group the 5s who have the same number cor- i 
reel (even though the items correct may differ). 


1. Find the number of 5s who have only one 1 response. Draw a line one col- i 
umn in from the right edge for these Ss. 


Extend the line vertically until you find the first 5 with two I responses. ; 
Move the line in another column at this point (two columns in from the right i 
edge for all 5s with two correct responses). 


3. Extend the line vertically until you find the first 5 with three 1 responses. 
Move the line in another column. 


Continue this process until the line is complete. If at any point there is no S : 
with the number correct, go to the next number. For example, if there were ; 
5s who had 3 correct, 5s who had 5 correct, hut none with 4 correct, the line 
moves from 3 columns to b columns. I hat is, the number of items correct : 
always equals the number of columns from the right for location of the line. 


1. Look to the right of the line and find any Os. Circle these as errors where the 
5 missed an item he or she was expected to get correct. 

2. Look to the left of the line and find any Is. Circle these as errors where the 
5 got an item we predict he or she would miss. 


3. Total the number of errors. 




In this final step, we need to determine the "marginals" for the data. This is the 
total for Os and Is at the bottom of each column. Add the larger number 
(whether 0 or 1) at the bottom of each column to find the maximum marginal 
value for the formula. 


The Guttman procedure can be done by hand, but unless the number of items in : 
the study is very small and the number of 5s is also small, it takes a great deal ; 
of time to sort all the data. It is much simpler to let the computer do the work. 
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: (-ficic arc. however, several cautions to keep in mind before you let the computer 
help you. I he computer can't make a number of important decisions for you. 


p ro blcnis with Implicational Scaling 

Implicational scaling is a useful technique for the study of language acquisition. 

also a valued technique in dialect studies. For example, sociolinguists have 
luscd the technique to scale phonological features of certain American dialects. 

:With the scale, it is possible to show that if a speaker uses the less frequent 
(higher on the scale) features of the dialect, he or she will also use those lower on 
the scale. Implicational scaling has, thus, been used to show orderliness in ac¬ 
quisition and orderliness in language variation data. 

iThe major problem in dealing with variable data, though, is deciding when a 
iform is acquired (thus, a 1) and when it is not (a 0). From your teaching and or 
vour own research, you know that learners use some forms correctly all the time, 
ruse others incorrectly all the time, but use a very large number of forms with 
varying degrees of consistency. For example, if you had collected free speech data 
I from a learner, you might want to know whether or not the person uses the three 
forms of -s agreement for the third person singular present tense (c.g., He 
laughs, dances, and plays the guitar"). You could go through the data and mark 
each place where the form is required (a.k.a. an obligatory instance). Then you 
;could tally the proportion of times the -s was used (a.k.a. percent supplied in 
obligatory instances). Once this is done, you must decide on a cutoff point. If 
your criterion for acquisition is 80%, you check the percentage and, Finding it is 
,'84%, you enter a 1. 

Happy about this, you decide to look at data from another student. This time, 
When you tally the percent correct, it turns out to be 79%, and you must enter a 
6. This must give you pause for thought. Is a learner who supplies the ending 
‘in 84% of the eases really that much better than someone who supplies the end¬ 
ing in 79%? Is everyone over 80% an acquirer and everyone under 80% a non¬ 
acquirer? Obviously, the researcher must justify decisions made in this regard. 

A second problem has to do with the question of what to do with missing data. 
When gathering observational data in natural settings, learners may simply no: 
use the forms you wish to scale. Perhaps they use only a few of the forms. How 
many instances must occur in the data? What if the person talked about pas: 
experiences and there were only three places where an -s for present tense would 
; have been appropriate? If the -s was supplied twice, the S obtains a 66% (a 0). 
With another topic, more -s forms might have been required and more supplied. 
;Thc researcher must decide how many potential uses arc needed. While convert- 
1 tion requires five instances and an 80% cutoff point, there is no well-documented 
' rationale for either of these conventions. If fewer than Five instances occur, then 
a "missing data" symbol is placed in the matrix. (Imagine that a learner used only 
Tour instances of a particular morpheme but got all four of them. This might, 
indeed, change the scalability of the morphemes.) Again, the researcher mus: 
give the reader information on what grounds decisions were made to code missing 
; data. 
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Finally, it is possible that reversals might occur in the acquisition of certain 
morphemes. For example, it's possible that beginning learners might obtain high 
scores for irregular past tense forms (e.g., "he sang," "1 ran"). Later, at the 
intermediate stage, the same learners might get low scores using the more regular 
rule for past tense ("he singed," "1 runned"). This is sometimes called a V-shaped 
learning curve. Such a learning pattern would definitely influence the scalability 
of the acquisition data. 

In spite of these problems, the Guttman procedure is useful when we want to 
know whether there is a scalc--an order of difficulty for items in a particular data 
set. If you do the procedure by hand, you will find it very tedious and, when a 
procedure is tedious, mistakes just seem to happen on their own. This is one of | 
several procedures for which we urge you to let the computer do the work. Check 
the printout for the statistics you need. Since the computer usually excludes 
missing data from the analysis, check to see how many missing values there were. 
Consider all the other warnings we have given you regarding the Guttman pro- j 
cedure. That is, don't let the computer think or interpret lor you, but do let it 
do the donkey work. 


Other Applications of Distribution Measures 

In this chapter we have looked at ways in which we can locate individual scores 
in a distribution, compare locations of scores in data where the units of mea¬ 
surement differ, and how wc can determine whether or not we can predict student 
performance on a scries of features (that is, whether the features will scale lor 
difficulty so that we can predict student performance on the basis of place in a 
matrix). 

I he examples we have given in most cases have been taken from general applied 
linguistics research rather than from language testing. Vet, wc can sec that each 
of these issues is important in language testing and program administration. 
Administrators of language programs must understand these ways of locating 
scores in distributions in order to make decisions about who should or should not 
be admitted to their programs. They need to understand them in order to assign 
students to the courses that will benefit them most. When test scores come from 
different testing instruments, it is crucial that they understand standardized 
scores. Without this knowledge, decisions may be made that do a disservice to 
students. 

Test designers, as well as researchers of interianguage use the Guttman procedure 
to determine the difficulty order of items. Students who have high achievement 
scores should not make errors on items that were designed for beginning students. 
Nor should beginning students do well on items thought to be among the most 
difficult. Scalability of test items is important here too. In addition, test devel¬ 
opers are often curious to know whether there is one underlying dimension to a 
total test or whether different parts of tests are tapping other language compo¬ 
nents. If there is one underlying dimension to a test, the Guttman procedure 
should show this unidimension with a high coefficient of scalability. (However, 
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ck of a high coefficient of scalability can be due to "cm mi," so this is not a good 
j-gumerit for multidimensionality in tests!) 




oooooooooooooooooooooooooooooooooooo 

ract ice 7.8 


Review the questions in your research journal. For which would Implicattonal 
scaling be an appropriate statistical procedure? Explain how you would over- 
me the problems that might be related to using the procedure or interpreting 
e output of the procedure for your project(s). 



OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


In this chapter wc have looked, then, at issues related to locating values within a 
ata distribution. We have recycled the notion of normal distributions. These 
nd other issues will reappear in different guises in the following chapters. 


Activities 

. M. A. Snow (1985. A descriptive comparison of three Spanish immersion 
fograms. Unpublished doctoral dissertation. Applied Linguistics, UCLA.) in- 
udes the following table (among many others) in her study. Site refers to the 
cations of the three different programs. 

M LA-Coop. Spanish Test Scores - Grade 6 Mean and Percentile Scores 




Listening 


Reading 


Writing 

n 

score 

sd 

%ile 

score 

sd 

Voile 

score 

sd 

Voile 

17 

31.71 

8.8 

81.4 

27.6 

7.2 

68.6 

50.4 

20.1 

59.1 

28 

38.90 

4.4 

94.7 

36.4 

7.3 

87.2 

66.4 

18.0 

76.6 

14 

32.20 

7.4 

84.1 

28.1 

6.8 

77.0 

44.9 

13.9 

52.1 


- At which site does it appear the Grade 6 children score highest? At which site 
do the Ss appear to be the most homogeneous in performance? At which site do 
e 5s vary the most in performance? Look at the percentile figures. These 
percentiles are based on norms listed by the test publisher. Since the test is widely 
USed, you can be sure that these represent a normal distribution. Note that the 
tercentiles here are not for individual scores but rather for the group mean score, 
elect one school site and think about the variability of the distribution for each 
ill tested. What would you estimate (yes, guess) that the range of individual 
percentiles might look like for children in this class on each skill? 
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2. K. Hyltenstam (1977. ImplicaLional patterns in Interlanguage syntax vari- 
ation. Language Learning, 27, 383-411.) looked at the order in which Ss from 
differing first languages acquired elements of the Swedish auxiliary system. Re¬ 
view the implicational tables given in the study. If this were your study, what 
changes would you make (and why)? 

3. R. Andersen (1978. An implicational model for second language research. 
Language Learning. 28. 2, 221-282.) criticizes morpheme acquisition studies, re¬ 
grouping the morphemes studied, and showing how the cutoff point affects the .J 
results. The data arc from compositions written in English by Spanish speakers. ' 
What advantages might compositions have over oral language data when we wish i 
to discover order in the acquisition process? What disadvantages do you see? 

4. A number of researchers have listed the most frequent types of morphology j 
errors of second language learners. These studies are usually based on proficiency 
examinations. What are the difficulties that you see in using proficiency ex£m 
errors to construct an implicational scale for ESL learners? What advantages j 
might the Bilingual Syntax Measure or the SLOPE test have over a general pro¬ 
ficiency exam for this purpose? 

5. Data for such a study are given in the following table. Conversational lan¬ 
guage data were gathered from 33 .Vs in natural settings. The data were tape- 
recorded and transcribed. Each S recebec a 0 or I to show whether each of these 
individual morphemes were acquired. 
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Morpheme 

Matrix 




s 

PaR 

Hv 

3rd 

Vn 

Pal 

AUX 

ING 

COP 

25 

1 

1 

1 

1 

1 

1 

1 

1 

27 

0 

1 

1 

1 

1 

1 

1 

1 

28 

0 

0 

0 

1 

1 

1 

1 

1 

20 

0 

- 

0 

1 

1 

1 

1 

1 

21 

1 

- 

0 

0 

1 

1 

1 

1 

26 

0 

0 

0 

0 

1 

1 

1 

1 

29 

0 

- 

0 

. 

1 

1 

1 

1 

30 

0 

- 

0 

0 

1 

1 

1 

1 

14 

0 

0 

1 

0 

0 

1 

1 

1 

6 

0 

1 

0 

1 

0 

1 

0 

1 

22 

0 

0 

0 

0 

0 

1 

1 

1 

32 

0 

-• 

0 

- 

0 

1 

1 

1 

24 

0 

- 

0 

. 

0 

1 

1 

1 

14 

0 

0 

0 

0 

0 

1 

1 

1 

15 

0 

- 

0 

- 

1 

0 

1 

1 

1 

0 

- 

0 

- 

0 

1 

0 

1 

7 

0 

0 

0 

0 

0 

0 

1 

1 

23 

0 

0 

0 

0 

1 

- 

0 

1 

3 

0 

0 

0 

1 

0 

0 

1 

0 

H 

0 

0 

0 

1 

0 

0 

1 

0 

12 

0 

0 

- 

0 

0 

0 

1 

1 

10 

0 

- 

0 

- 

0 

1 

1 

0 

19 

0 

0 

0 

0 

0 

0 

0 

1 

31 

0 

0 

0 

0 

0 

0 

0 

1 

33 

- 

- 

0 

- 

0 

0 

0 

1 

18 

0 

0 

0 

0 

0 

- 


1 

5 

0 

- 

0 

- 

0 


1 

0 

8 

0 

- 

0 

- 

0 


1 

0 

16 

0 

0 

0 

0 

- 

0 

0 

1 

2 

0 

0 

0 

0 

0 

0 

0 

0 

9 

0 

0 

0 

0 

0 

0 

0 

0 

19 

0 

0 

0 

0 

0 

0 

0 

0 

17 

0 

0 

0 

0 

0 

0 

0 

0 


j *COP - copula (be), ING-continuous, AUX = /»e as an auxiliaiy. Pal = irregular 
past tense, Vn = past participle, 3rd = third person singular present tense, 
Hv = present participle, PaR = regular past tense. 

Draw in the line and determine the total number of "errors." Do a Guttman 
analysis of the data. What questions do you have regarding the results? If this 
were your study, how might you redesign it to answer the questions you have 
: raised? 

The cutoff point for the assignment of 1 vs. 0 in this table was 60% accuracy. 
Since we had access to a computer program, we reran the analysis using a cutoff 
point of 80%. While the morphemes appeared to be nicely scaled for difficulty 
when 60% was the cutoff point, that scaling was not so convincing at the 80% 
jutoff point. In fact, the order of difficulty for the morphemes at the 80% cutoff 
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point is different from that obtained ni the 60% cutoff point. If the claim of i 
scalability can be made only in terms of ihe criterion level, what problems do you ■ 
see for claims of a "natural order of difficulty" for these morphemes? Mow might 
the researcher determine which cutoff point is appropriate? 

6. R. Scarcella (1983. Developmental trends in the acquisition of conversational : 
competence by adult second language learners. In N. Wolfson and E. Judd, 
[Eds.]. Sociolinguistics and Language Acquisition. New York, NY: Newbury 
House.) hypothesized, on the basis of her analysis of conversational interviews, ; 
that there may be an "order of acquisition' for conversational competence. For 
example, she found that openings and closings were better performed than pre¬ 
closing moves. She also found that "requests for clarification" were less well 
managed than "repairs." This is the study for which we presented pilot repli¬ 
cation data in this chapter. Imagine that you had the list of universal of com¬ 
munication proposed by Goffman (which contains communication signals beyond ■ 
those tested by Scarcella) and wish to test the notion of a natural order of ac¬ 
quisition of these universal by second language learners. How might you design 1 
the project and how would you carry out the Guttman analysis (cutoff points, 
etc.). 

7. J. Amastac (1978. The acquisition of English vowels. Papers m Linguistics, 

//. 3-4, 423-4‘SX.) investigated the order of acquisition of English vowels by rtir.e ; 
fluent Spanish-English bilinguals. In the following chart, a 2 represents standard : 
form and a I represents fluctuation between standard and other forms. Notice 
that the table is not set up in the same way as our examples. Items with the trust : 
correct responses are on the right and the student with the most items correct is 
at the bottom of this chart. You will need to reverse directions in locating the ; 
line. 


Implicationa! Scale for Vowels 


iV 

1 

2 

1 

1 

1 

1 

1 

1 

1 


/E/ 

1 

1 

1 

1 

2 

2 

2 

2 

2 


/U / 

1 

1 

1 

2 

2 

2 

2 

2 

2 


/a e/ 
1 
2 
2 
2 
2 
2 
2 
2 
2 


<> 

1 

2 
2 
2 
2 
2 
2 

1 

2 


First, draw the line. Then, circle the responses that are "errors" (i.e., where Ss 
either use variable performance when they should have used standard or used 
standard when variable performance would be expected). Do a Guttman analysis 
of the data. What questions do you have regarding the procedure and the results? 
If this were your study, how might you redesign it and answer the questions you 
have raised? 

8. P. Hopper & S. Thompson (1980. Transitivity in grammar and discourse, 
Language, 56, 2, 251-299.) listed 10 features they believe determine the degree of 


220 The Research Manual 




!transitivity of clauses in discourse. That is, a clause such as I was drinking some 
of the punch" would have weak transitivity and a clause such as I drank up the 
punch" would have strong transitivity. Strength of transitivity is determined by 
these 10 factors. The question is whether each factor weighs the same or whether 
vve C an determine a scale for the 10 factors. Can we predict which factors high 
transitive clauses will exhibit and which factors even low transitive clauses have? 
Do the factors exhibit a scale? 

The clauses in the following data come from a casual telephone conversation be¬ 
tween a brother and sister. In the chart, each clause has been coded as + or - 
on each of the 10 factors. The utterances are arranged so that the example with 
the largest number of factors is at the top and the utterances with the fewest 
factors is at the bottom. The features themselves are arranged from high transi¬ 
tivity to low transitivity across the top. 


Features* 


Utterances 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

10 I called him 

4- 

+ 

+ 

+ 

+ 

+ 

4 - 

4- 

4- 

+ 

l? 1 defended you 

+ 

+ 

- 

4- 

+ 

4- 

+ 

4- 

4- 

+ 

8 1 washed some sheets 

+ 

+ 

4- 

4- 

+ 

4 - 

- 

+ 

4- 

4- 

9 I'm also washing some sheets 

- 

- 

4- 

4- 

+ 

4 - 

4- 

4- 

4- 

4- 

15 I've- cotta call her 

- 

- 

4- 

4 - 


4 - 

4 - 

4 - 

4- 

4- 

[6 We can call her 

- 

- 

4 - 

4- 

- 

+ 

4 - 

4 - 

4 - 

4- 

1 I was preparing for you tomorrow 

- 

- 

- 

4- 

+ 

+ 

4 - 

4 - 

+ 

4- 

12 He finally got one 

- 

+ 

+ 

- 


+ 

- 

+ 

4 - 

4 - 

2 I was thinking about it 

- 

- 

- 

- 

- 

+ 

4- 

+ 

+ 

+ 

14 Can you imagine Mother and Dad 

- 

- 

- 

-■ 

- 

+ 

4- 

+ 

+ 

+ 

giving a Halloween party? 

IS I'm gonna leave the door open 

- 

- 

- 

- 

- 

4- 

4- 

4- 

4- 

4- 

21 See vou this evening 

- 

- 

- 

- 

- 

4- 

4- 

+ 

4- 

4- 

4 You have your car 

- 

- 

- 

- 

- 

- 

+ 

4- 

4- 

+ 

6 You want me to tell Scott you're 

- 

- 

- 

- 

- 

- 

+ 

4- 

4- 

4- 

coming? 

13 They're gonna give a Hallowe'en 

- 

- 

- 


- 

4- 


4- 

4- 

4- 

3 Which will give me the week to 

- 

- 

- 


- 

- 

4- 

4- 

4- 

4- 

11 He's got an answering machine too 

- 

- 

- 


4- 

- 


4- 

4- 

4- 

19 I can't believe it 

- 

- 

- 


+ 

4- 

- 

- 

4- 

4- 

20 You don't know me very well 

- 

- 

- 


+ 


4- 

- 

4- 

4- 

5 That will make it quite easy 

- 

- 

- 





4- 

4- 

4- 

7 If you don't want me to come 

- 

- 





+ 


4- 

4- 

*The features are: 


1. Affectedness of object-how completely object is affected ("I drank up the 
milk" vs. "I drank some of the milk") 

|JJ|y 

2. Aspect—action is completed ("I ate it" vs. "I am eating it") 

3. Punctuality-verb has no "phase" between onset and completion ("kick" vs. 
"carry") 
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■ 


4. Kinesis-action veil) vs. state verbs ("hug" vs. "like") 

5. Mode-rcalis vs. irrealis 

6. Volitionality—agent acting purposefully ("I wrote" vs. "I forgot your name") 

7. Individuation of object-human, animate, definite, individuated ("Fritz drank 
the beer" vs. "Fritz drank some of the beer") 

8. Affirmation-positive vs. negative clause 

9. Participants-number of arguments of verbs (Note: matrix only contains 

two-participant clauses in the conversation as defined for this particular 
study) "'iff 

10. Agency-degree of willing/ablcncss of agent ("George startled me" vs. "The 

table startled me") 1 

'4 

Draw in the line and determine the total number of "errors." Do a Guttman 
analysis of the data. Are the data scalable? The data come from a section of one 
conversation. If you wished to establish scalability for Thompson & Hopper's 
factors in oral language, what other kinds of data might you want to include? 
Justify your choice. How might you be sure that the coding (the assignment of 
; or ) is correct in the data prior to doing the Guttman procedure? 
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Chapter 5 

probability and Hypothesis 
Testing Procedures 


• Probability 

Probability of individual scores in a distribution 
The normal distribution (z score) table 
Probability of group scores or frequencies in a distribution 
•Steps in hypothesis testing 

• Distribution and choice of statistical procedures 

Choices in data analysis 

Parametric vs. nonparametric procedures 

In previous chapters, we have shown ways of describing data so that infonnuticn 
on outcomes can be seen at a glance. In most research, however, we want to do 
■more than describe. We want to know whether the data we have described can 
be used as evidence in support of our hypotheses. 

:If, in describing data, wc sec differences in frequencies or in scores, we want to 
know how confident wc can be about any claims wc want to make. Are the dif¬ 
ferences large enough? Might apparent differences just be normal variation in 
human performance? Are the differences real? Statistical tests have been created 
precisely to answer these questions. A statistical test checks the probability of any 
fwitcome and thus tells us whether we can have confidence in our claims. In some 
cases these claims are descriptive; the statistical test gives us confidence that the 
descriptions of the data are correct. The statistical test is used for descriptive 
purposes. In other cases, we hope to generalize, to make inferences from our data 
to other learners or other cases. This involves inferential statistics. Statistical 
tests give us confidence that our descriptions and/or inferences are correct. 

The purpose of statistical tests, then, is to give us (as researchers or as readers of 
research) confidence in claims about the data. They do this by estimating the 
probability that the claims are w’rong. When researchers report the statistical 
significance of their results, it means that they have applied a statistical test to the 
data and that this test has, via probability, told them how much confidence to 
place ir. their claims. 
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Probability 


Probability is not an easy concept to define, but we all have a notion of what is 
probable and what is not. I he weather reporter on the local TV channel tells us 
the probability of rain, our Lotto tickets tell us the probability of winning "The 
Big Spin"-about the same probability as being struck by lightning on the ex¬ 
pressway. We are all surprised, and often delighted, when something improbable 
happens (not being struck by lightning, of course). We don't expect to win a '! 

million dollars, a trip for two to Tahiti, or even the local raffle of a chocolate i 

cake. But there is always a chance. 

In research, when we offer a hypothesis, wz are making an educated guess about i 
what is or is not probable. Similarly, the weatherman who says that the proba¬ 
bility of rain is high isn't making a wild prediction. The prediction is based on 
previous information about rain at this time of the year given prevailing condi¬ 
tions, information gathered over many years. The prediction is often reported in 
terms of "chancc"--e.g., that there is an 80% chance of showers occurring in the 
morning hours. This means that there arc 20 chances in 100 that the weather 
forecaster may be wrong in claiming rain will occur. These odds relate to prob¬ 
ability. 

When we look at the frequencies, ratings, or scores that we have obtained in o.u 
research, we want to be sure that they truly support our hypothesis before we 
predict rain! And, we want there to lv fewer than 20 chances in 100 that we are 
wrong. If we have stated our hypothesis in the null form, then we want to feel 
confident in rejecting it. 

When we reject the null hypothesis, nr want the probability to be very low that we 
are wrong. If, on the other hand, we must accept the null hypothesis, we still want 
the probability to be very low that we are wrong in doing so. 

In many statistics books, writers talk about "Type I" and Type 2" errors. A 
Type I error is made when the researcher rejected the null hypothesis when it 
should not have been rejected. For example, the person said his group of 5s were 
better than a control group, when in fact, they probably weren't. A Type 2 error, 
on the other hand, occurs when the null hypothesis is accepted when it should 
have been rejected. That is f the person said his group wasn't any better when, in 
fact, it was. That's a Type 2 error. Statistical tests help us to avoid these errors. 
They give us confidence in our claims by telling us the probability that these 
particular results would obtain based not on simple odds but on statistical prob¬ 
ability. 

In research, we test our hypotheses by finding the probability of our results. 
Probability is the proportion of times that any particular outcome would happen 
if the research were repeated an infinite number of times. The way we are able 
to do this relates to our earlier discussions of the normal distribution. For any 
human activity, if we continue to collect data from more and more subjects or 
events, we will approach a normal distribution. It is possible then to locate the 
data of one 5 or event in the distribution and talk about how typical (how prob¬ 
able) it is in the total normal distribution. In addition, we will be able to locate 
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data of groups in distributions and find how typical or probable it is in a similar 
wav. 

! probability of Individual Scores in a Distribution 

In order to understand probability better, let's review what we already know 
about locating individual scores in a distribution. If we have the X and the s.d. 
of a distribution, we can locate any individual score in the distribution by using 
z scores. And we can find the probability of obtaining that particular 2 score in 
the distribution. 

Imagine that you administered a general language proficiency examination to a 
group of 100 students at your school. The test publishers have told you that the 
X for the test is 75 and the standard deviation is 10. You have corrected the ex- 
lams of your students. What is the probability that the very first paper you pick 
up wif. have a score above 75? The answer, if you remember the characteristics 
of the normal distribution, is easy. Remember that in the normal distribution, 

;the mode, the median, and the mean are the same. The mean is the balance point 
on the seesaw. Half of the scores fall below the mean and half above it. The 
probability that the first paper will be above 75 is .50. The probability that it 
will be below 75 is .50. 

; Remember that the characteristics of the normal distribution tel! us that 34% of 
the scores will fall between the mean and I standard deviation. 



In this case, the probability that the next paper you pick up will have a score 
below 85 is .84. Half (50%) of the scores are below the mean and 34% (more 
precisely, 34.13%) of the scores arc between the mean and + 1 standard devi¬ 
ation. 
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The probability that the next paper will have a score lower than 95 is 50% from 
below the mean + 34% (actually 34.13%) between the mean and 1 s.d. + 13% 
(more precisely, 13.59%) between 1 s.d. and 2 s.d. = 97.78% or 98%. The 
probability that it will be higher than a score of 95 is .02 (2.28% or 2 chances in 
100 ). 

For those 5s who didn't do so well, the probability that the first paper you choose 
will have a score lower than 55 is .02 (again, the chances arc very low--2 chances 
in 100). 

When the probability level is .05, this means that there are 5 chances in 100 of 
obtaining this score given a normal distribution. There is only 1 chance in 1,000 
that a 5 would obtain a score of 105 on this test. When the score is extremely 
high or extremely low, the probability level of such a score is low. There are very 
few chances of that score occurring. 

When individuals obtain scores that have very low probability (either much 
higher than expected or much lower than expected), those scores fall under the 
tails of the distribution. Scores in these areas are scores that, because they are 
so improbable, need to be thought about carefully. Perhaps they are people who 
have certain characteristics, study especially hard, have received training of a 
special type, and so forth. Somehow these extreme scores "don't belong" in this 
distribution even though they are part of the distribution. The reason such scores 
"don't belong" may have to do with some independent variable we want to test. 


ooooooooooooooooooooooooooooooooooooo 

Practice 8.1 

► 1. Assume that you gave a listening test to 60 5s. The mean for the group was 
25 and the standard deviation was 5. 

a. What is the probability of scoring below the mean?_. What is the 

probability of scoring below 20?_. What is the probability of scoring 

above 30?_. 

b. Assume 15 is the "passing" cutoff point. What is the probability of not passing 

the test (getting a score below 15)?_. 

2. Assume 35 is the lower boundary for an "A" on the test. Draw a frequency 
polygon for this distribution. Label the mean and standard deviation points. 
Shadow the area for scores below the "A" point. What proportion of scores fall 
in this area? What is the probability of getting an "A" on the test?_. 


226 The Research Manual 


r- 


25 

ooooooooooooooooooooooooooooooooooooo 

The Normal Distribution (z Score) Table 

Obviously, all scores in a distribution do not fall precisely at the X, 1 s.d ., 2 s.d., 
and so forth. Mathematicians, working with the concept of normal distribution, 
have given us tables that tell us precisely how probable any individual z score is 
for any test. You will find the table for the z score distribution in appendix C 
(table 1). If you know yoUr score on a test (and the mean and s.d. so that you 
can figure out your z score), you can discover the probability of your score by 
consulting this table. 

The first column of the table gives the z score obtained in the calculation. The 
second column is the proportion of scores we can expect to find between the z 
score and the X. The third column is the proportion of scores in the curve on the 
other side of the z score (i.e., the probability of that particular z score). To keep 
the page looking balanced, the z scores continue from the bottom of the first three 
columns to the next group of three, and then the next group of three. 

Notice the very last z score on the first page of the table. The z score is 1.64. 
The area beyond the z score of 1.64 is .05. This is the z score needed for an .05 
probability level. That is, there arc only 5 chances in 100 of obtaining a z score 
this high in this distribution. The probability of a score between a z score of 1.64 
and the mean is .45 (from the second column in the table). There are 45 chances 
in 100 of getting a score between this z score and the X and 50 chances in 100 
of getting a score below the X. So, there are 95 chances in 100-a .95 probability 
of getting a lower score, but only 5 in 100 of obtaining a score this high. 

Continue consulting the z score values until you find an "area beyond z" that 
equals .01. A z score of 2.33 shows us that the likelihood of such a score is 1 in 
100. The z score needed to give us a probability of 1 in 1,000 (i.e., a probability 
of .001) is 3.08. 

Thus, we can discover the probability of any score in a normal distribution if we 
compute the z score and consult the z score table. 
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Practice S.2 

► 1. 1 ind the probability associated with the area between the mean and a .? 
score of .34._. A ? score of 2 has what area of the distribution beyond ii‘> 


► 2. Find the probability associated with z scores of 

1.28_ 

0.82_ 

3.00_ 

► 3. Find the proportion of scores lower than one with a z score of (remember 
the 50% below the mean!) 

0.84_ 

1.99_ 

1.48_ 

► 4. Find the 2 score associated with a probability of 

0.50_ 

0.03_ 

0.025 _ 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO • 

Probability of Group Scores or Frequencies in a Distribution 

Now lei's transfer what you know about the probability of individual scores in a 
distribution to the probability of group scores or frequencies in a distribution. 

Just as we expect that individuals will display behavior which is within the range 
of normal human performance, we expect that groups of individuals will display 
behavior which is within the range of normal performance for groups. The same 
principle applies. We expect that if we find the mean score for our group it will 
fall somewhere in the normal distribution of means for all groups. 

That is. if we give a test to many groups of students over and over again, we ex¬ 
pect that the distribution of the means will approach a normal distribution. So, 
wc can ask how probable it is that the data from our sample group of Vs fit into 
that normal distribution. If we have given our .Vs some special new instructional 
program, we hope that our group score will be at the far right tail of the distri¬ 
bution polygon. The probability of that happening is not large if the null hy¬ 
pothesis is correct. If the probability of our group mean in the distribution of: 
group means is .01, wc should be delighted. The likelihood of our sample group 
of students scoring that high is l in 100. We have good cause to reject the null 
hypothesis. 
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^probability, then, is related to hypothesis testing. When the probability level is 
very low, we can feel confident that our sample group of 5s differs from other 
||roups who may have taken the test in the past or who might take it in the 
IfUture—that is the population. Our group of 5s forms a sample and we test 
whether the data from that sample "fit" with that of the population—all the other 
groups in the distribution. 

You'll remember from earlier chapters that we state our hypotheses in the null 
form. Our aim is to reject the null hypothesis. The null hypothesis says there is 
no difference between the mean of our sample group and the mean of the popu¬ 
lation from which it was drawn. (This always seems strange since we hope that 
there is a difference.) We want to reject the null hypothesis and we use the 
probability of the location of our data in the normal distribution for this purpose. 
(Yes, in case you wondered, mathematicians have worked out special normal 
distributions for group means as well as for individual scores. They also have 
established probabilities for other special distribution curves where the median 
was selected as the measure of central tendency.) 

The next issue is just how improbable a finding must be before we feel confident 
-about rejecting the null hypothesis. Usually we want the probability to be very 
low indeed. The practice in most applied linguistics research is not to reject the 
friull hypothesis unless there arc fewer than 5 chances in 100 (.05 probability level) 
of obtaining these results. (So that's what the p < .05 means in all those tables 
in articles in our journals!) The .05 probability tells us there are fewer than 5 
chances in 100 that we are wrong in rejecting the H 0 . If the probability level is 
set at p < .01, there is only 1 chance in 100 of error in rejecting the H 0 . If the 
probability level is set at p < .001, there is 1 chance in 1,000. We can have 
Confidence in rejecting the null hypothesis. 

|Just to be absolutely sure the concepts we have been discussing are clear, look at 
Ithe following visuals. If the score for our sample group falls in the shaded area, 
fwe will not reject the null hypothesis. The score is typical of that of all groups in 
It he population. 



If, however, the score falls in one of the shaded areas in the following drawing, 
we will reject the null hypothesis. Our score is not highly probable in this distri¬ 
bution. (There are 5 chances in 100 of being wrong when we reject the H 0 .) 
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You will notice that although we set .05 as our probability level (level of signif¬ 
icance), we have had to divide the .05 in two, giving an .025 area to each tail of: 
the distribution. Since we specified no direction for the null hypothesis (i.e.. 
whether our score will be higher or lower than more typical scores), we must 
consider both tails of the distribution. This is called a two-tailed hypothesis. 
When we reject the null hypothesis using a two-tailed test, wc have evidence in 
support of the alternative hypothesis of difference (though the direction of the 
difference was not specified ahead of time). 

If. however, we have good reason to believe that we will find a difference (for : 
example, previous studies or research findings suggest this is so), then we will use 
a one-tailed hypothesis. One-tailed tests specify the direction of the predicted 
difference. We use previous findings to tell is which direction to select. 

In a positive directional hypothesis we expect our group to perform better than 
normal for the population. For a positive directional (one-tailed) hypothesis we 
can reject the null hypothesis at the .05 level if the scores fall in the shaded area: 



-3 -2 -1 0 +1 +2 +3 


With a negative directional (one-tailed) hypothesis, we expect our group to per¬ 
form worse than the population. For a negative directional hypothesis we can 
reject the null hypothesis at the .05 level of significance if the scores fall in the 
shaded area: 


230 The Research Manual 





-3 -2 -1 0 +1 +2 +3 


Let's see what this means, now, in reading the z score table in appendix C. In 
discussing the placement of individual scores in the distribution, we noted that a 
\ z score of 1.64 leaves an area of .05 beyond it under the tail to the right of the 
distribution. A -1.64 2 score leaves an area of .05 beyond it under the tail to the 
left of the distribution. These are the values needed for testing a one-tailed di¬ 
rectional hypothesis. 

However, given the paucity of replication studies in our field, we seldom use a 
one-tailed directional test. Rather, we state a nondirectional null hypothesis. 
We can reject the null hypothesis if the z value falls under cither the right or the 
left tail. If we set an .05 level for rejecting the null hypothesis, that .05 level must 
:be div ided between the right and left tail. 

:Turn t.> the 7 . score tabic and scan the 7 score values until you find the place 

; where '.lie area beyond is .025 (half of the .05 level). That value is 1.06, the .7 

lvalue needed to reject a null hypothesis at the .05 level. Now find the .7 score 

: where *hc area beyond z is .005 (half of the .01 level). I hat value is 2.57, the z 

lvalue needed to reject a null hypothesis at the .01 level of probability. 

You might want to mark these values directly on the table. With practice, you 
will soon remember the values and not need the table at all. The values to re¬ 
member are: 

z Values Needed in Hypothesis Testing 
.05 .01 

l-tailed (directional) 1.64 2.33 

2-tailed (nondirectional) 1.96 2.57 

As you can see, it is easier to reject a directional hypothesis that is one-tailed than 
a two-tailed nondirectional null hypothesis. When previous research has already 
Shown that a null hypothesis can be rejected, there is less chance of making an 
error-either a Type 1 error or a Type 2 error in a replication study. Thus, it 
makes sense to use a one-tailed directional hypothesis. 

|We have said that the researcher decides on the level of significance—the proba¬ 
bility level—at which he or she will feel confident in rejecting the null hypothesis. 
This p level—sometimes called the alpha level and noted as a—tells us how likely 
IWe are to be right or wrong in rejecting the null hypothesis. We also said that 
^convention suggests this level be set at .05 where there are 5 chances in 100 of 
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being wrong and 95 chances in 100 of being right. However, this is convention, 
not a hard-and-fast rule. It's true we want the probability of being wrong to he 
very low, but we don't want to overlook promising trends. For example, if you 
evaluated the effectiveness of a new computer-assisted program designed to teach 
technical English vocabulary and the probability of your outcome was .10, the 
outcome is worthy of notice. There are 10 chances in 100 that you would be 
wrong in rejecting the null hypothesis. That may be too many chances of error 
to give us great confidence. You would accept the null hypothesis in such a ease, 
but it is still legitimate to report a trend in favor of the program. Otherwise, you 
and your audience might decide that the program was "worthless." Recognition 
of trends does not mean that one probability level is "more" or "less" significant 
than another. We set .05 as our cutoff point and on that basis we cither can re¬ 
ject or cannot reject the H 0 . We can talk about "trends" when p narrowly misses 
the established a (usually .05) cutoff point. Having set the cutoff point, we de¬ 
termine whether we can or cannot reject the H 0 . W'e can discuss trends; however, 
we never look at values that do not make the cutoff point (say, a .12 and a .341 
and say that one is more or less probable than the other. They are all simply 
typical values in the normal distribution. 

In hypothesis testing the basic question is: How sure are we that we are right? 
There is, however, a second question: II hat do we lose if ivc are wrong? Statis¬ 
tical tests give us the answer to the first qiit*stion--how likely we are to be right. 
T his is a test of statistical significance. When we consider the implications of the 
results in practical terms, we apply common sense and a consideration of alter¬ 
natives to answer the second question. I his is a matter of practical significance. 
Significance is, therefore, a technical term and a practical term. Roth questions 
arc important, for both need to be answered when we make decisions based on 
research. In some fields, lives may depend rn being right, so stringent levels must 
be set. In our field, we can usually affotd to be more lenient. If the outcome of 
the research evaluating a computer-assisted program for teaching vocabulary 
showed a .10 or .15 level of significance, and you had set .05 as the level for re¬ 
jecting the null hypothesis, you do not have statistically significant findings. 
However, you could still argue for continuing the program until you found 
something better based on the trend and on practical significance. That is, deci¬ 
sion making is based on many factors where both questions—how sure are we of 
being right and what do we lose if we are wrong--a re considered. 

Decision making should not, of course, impinge on the research process. The two 
are separate. Researchers, however, usually undertake their work with the hope 
of influencing decision making, making recommendations. We want to feel con¬ 
fident that the recommendations we make arc justified. 


OOOOOOOOOOOOOOOOOOOO0OOOOOOOOOOOOOOOO 

Practice 8.3 

Think about decision making in the following examples. What weight do the 
statistical levels of significance have on practical decision making in each? 
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I Your private language school is just three years old and is beginning to shew 
Ig prof.t. A new set of very expensive and very innovative multimedia and 
computer-assisted teaching materials is now on the market. They report incredi¬ 
ble student gains as a result of using this program (an .01 level of significance). 
Your students are also very impressed by technological toys. 

What do you stand to lose if you do or do not purchase them? 


2 . The school district wants to apply for a state grant to revise and evaluate a 
special teachcr-dcsigncd program in intensive oral language for ESL children. 
The teachers who developed the program want to add materials to transfer oral 
language skills to reading. In the grant request, they note that the oral language 
scores of children showed gains on certain grammar structures (.01 level for 5 
structures. .05 level for 2, and .10 for 4). Without the special program to teach 
transfer, reading scores improved but not significantly. 

The state has few grants to distribute. Does this look like a promising candidate 
for funding? Why (not)? 


:3. As a member of a research and development team, you have found that seven 
clause types are very difficult for young ESL learners. 1 he statistical test you 
iferformed showed these seven to contrast in difficulty with twenty other clause 
Types at the .001 level of significance. In your research report to the development 
fleam you suggest that such clauses not be used in the initial reading materials 
ieveloped for young ESL children. The development team also receives a report 
ffrbm the art department which does the visuals and layout of the reading books. 
If these clauses are deleted and others substituted, it may mean that new illus¬ 
trations and new layouts will have to be done. As a practical matter, what deci¬ 
sions would you make regarding revisions? Your justification? 


ooooooooooooooooooooooooooooooooooooo 
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Steps in Hypothesis Testing 


Let's summarize our discussion regarding probability and hypothesis testing with 
a list of steps: 


I. State the null hypothesis. 


Decide whether to test it as a one- or two-tailed hypothesis. If there is no 
research evidence on the issue, select a ;wo-tailcd hypothesis. This will allow : 
you to reject the null hypothesis in favor of an alternative hypothesis. If there ; 
is research evidence on the issue, select a onc-tailcd hypothesis. This will al- ! 
low you to reject the null hypothesis in favor of a directional hypothesis. 


3. Set the probability level (a level). Justify your choice. 

4. Select the appropriate statistical test(s) for the data. 

5. Collect the data and apply the statistical test(s). 

6 . Report the test results and interpret them correctly. 




We have discussed the first three steps on this list in some detail. We arc ready 
now to consider just how we decide on a statistical test for research data. 


Distribution and Choice of Statistical Procedures 


We have said that when we collect sufficient data, the distribution of those data • 
will approach the normal distribution. Wc often assume that if we carefully 
identify the population from which wc want to draw our subjects (say, the pop- : 
ulation of language learners in general) and if we randomly select subjects from : 
that population, the data from this randomly selected sample of 5s will be repre¬ 
sentative of that population. The scores from this sample will form a normal 
distribution which will fit into the normal distribution of the population from 
which the sample was drawn. Thus, we will be able to draw inferences about the 
population on the basis of our sample data. 


When we want to generalize from our sample to the population it was drawn 
from, we must be certain that the sample is truly representative. We have talked ! 
about this issue in previous chapters. Random selection is an important issue 
here. However, another issue is how large a sample we need to be sure the 
distribution of the data will be normal and representative as well. One basic rule 1 
of thumb is that the sample must include 30 or more cases, but this is not a set 
rule. 


Imagine that your research involved the ability of eighth-grade students to find 
errors in a written text. However, you wonder whether boys and girls might, in 
fact, perform differently. If you had only 30 cases (15 boys and 15 girls), you 
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woukl not have met even this minimum requirement. You would need 60 5s, 30 
boys and 30 girls, to meet the minimum requirement. 

The nature of the population you have identified for your study will also deter¬ 
mine the number of cases needed. For example, if the population of concern is 
eighth-grade Korean girls who arrived in San Francisco and entered school this 
year, the population is limited. A sample of 30 cases might be drawn that could 
reasonably be expected to represent this population. If the population of concern 
is Korean students studying English in eighth-grade classes in the United Slates, 
a sample of 30 cases drawn at random would probably not match the population. 
If you wanted to generalize your findings to all Korean students studying English 
(rather than just eighth-grade students), the population becomes even larger. No 
matter how carefully you stratify this random sample, it is unlikely that you will 
get a '.rue match on the distribution of the data drawn from this sample and that 
drawn from the total population of Korean students studying English. 

Representativeness of the cases in any sample is important. A second factor of 
importance is the distribution of the data obtained from the sample. We have 
said that if the sample is large enough, the distribution of the data will approach 
a normal distribution. Again, 30 is the magic number at which we assume the 
distribution to be close enough to normal. However, it is important to check to 
be sure this is the case (not just to assume it is okay). Since we know the distri¬ 
bution will never be exactly that of a normal distribution, the question is: How 
close must it be? 

Think for a moment of the diagrams we drew of skewed distributions. We sug¬ 
gested that skewed distributions occur when some scores (or observations) arc so 
extreme that the mean is not an accurate measure of central tendency. This oc¬ 
curs because some of the .S's or observations are not typical of the group. 

You can see that if care is not taken in selecting the sample, we cannot expect a 
normal distribution, and that if we do not obtain a normal distribution, we must 
conclude that the sample includes people who are not representative of the pop¬ 
ulation we wanted to study. 

Sometimes we cannot randomly select a sample but must work with an intact 
group, our very own class of students. We want to describe their behavior, not 
generalize to all such classes. Their performance (assuming it's a class of 30) may 
approach a normal distribution. However, the class may have a few individuals 
in it who give extreme data (either outperform or underperform all others). The 
distribution may not be normal, so the median, rather than the mean, should be 
used as the measure of central tendency. What are we to dc? The answer to this 
question is solved by deciding just how to analyze and interpret data. 
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Practice 8.4 

1. If you could draw only 30 5s to represent the population of eighth-grade 
Korean ESL students in the United States, how would you stratify the sample to 
make it as representative as possible?_ 


2. If you wanted to generalize to all ESL eighth-grade students, would a sample 
of 30 5s suffice to represent this population? Why (not)?_ 


3. If we have already randomly selected 5s and some of these skew our data, this 
is a warning that we must be cautious about generalizing. If we look at the sped 
cial characteristics of those 5s who have skewed the distribution, we may find 
they have characteristics that suggest another subgroup within the population. 
If so, should we find a random sample of 30 such .S's, add these to our study, and- 
test lor differences among the subgroups? Or should we add more 5s to our study 
to get a more representative sample so that we obtain a more normal distribution? 

Discuss this issue in your study group. If you had unlimited resources, how 
would you resolve these research problems?_ 


ooooooooooooooooooooooooooooooooooooo 

Choices in Data Analysis 

When we gather data and obtain results, we test the results to determine whether 
they support our hypotheses or not. If we claim that they do, we want to be 
certain we are correct. The use of statistical procedures, via probability, tell us 
whether ^vc can have confidence in our claims. 

The choice of an appropriate statistical procedure that establishes this confidence 
is extremely important. You can know all about various statistical procedures 
(just as you can know all about addition, multiplication, subtraction, and divi¬ 
sion), but if you don't know which procedure to use, this knowledge won't help 
much. 
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To select an appropriate statistical test for your data—the procedure that will tell 
vou how much confidence you can have in the claims you make-means making 
decisions about a number of factors. We have mentioned many of these in earlier 

chapters. 

The first deciding factor has to do with identifying the number of independent 
and dependent variables and the number of levels of each variable in your study. 
The number of variables and the number of levels of each variable determine, in 
part, the type of statistical procedure you will use. 

i A second factor that influences the choice of a statistical procedure is type of 
; comparisons to be made in the study. Are you comparing the performance of 
v individuals with themselves on different tasks or on the same task at different 
times? Arc your comparisons among different people all doing the same task(s)? 
Arc your comparisons among groups of people where each group receives a 
slightly different treatment? 

The third has to do with how you measured your variable(s). Arc the data fre¬ 
quency counts of nominal data—counts that say "how many" or "how often 
rather than "how much"? Are they scaled, ordinal data where responses are 
ranked with respect to each other along a continuum? Or have you measured 
equal-interval data? Your answer to this question also determines, in part, the 
kind of statistical test you will use. 

The fourth decision relates to continuous data. Do the data form a normal 
distribution? What is the better measure of central tendency—the mean or the 
median 0 

The tilth factor has to do with the shape of the distribution in the population 
from which the samples have been drawn and, finally, whether you hope to gen¬ 
eralize to the population. In this case, does the sample adequately represent the 
population to which you hope to generalize? 

All five of these factors are important. The last three issues, however, have much 
to do w;th whether the statistical procedure will be parametric or nonparametric. 

Parametric vs. Nonparametric Procedures 

if '; 'A, nonparametric procedure is one which does not make strong assumptions 
about the shape of the distribution of the data. Parametric procedures, on the 
other hand, make strong assumptions about the distribution of the data. First, 
strictly speaking, parametric tests assume that dependent variables are interval 
scored or strongly continuous. That is, a parametric test assumes that the data 
’ are not frequencies or ordinal scales but interval data where the X and s.d. are 
appropriate measures of central tendency and dispersion. Nonparametric tests 
• work with frequencies and rank-ordered scales. They assume that the variable 
being studied has some underlying continuity and normality of distribution, but 
the assumption is weak compared with that for parametric tests. 
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While most nonparameliic tests apply m data that arc nominal or rank-ordered 
they can be used with interval data as well. There are times when you may net 
feel all that confident about the equal-intcrvalness" of the measurement and the 
use of X and s.d. to represent the distribution of data. You may be fairly sure 
that someone who scored a 60 is really better than someone who scored 30, but 
you aren't too sure about the difference between a 50 and a 60. In that case, you 
do think there is some underlying linearity to the data but you don’t feel that the: 
points on thejinc arc really equally spaced. If a statistical test requires interval 
data where X and s.d. can be used, that is what is required. "Sort of" is not 
enough. If you're not happy with the intcrvalncss of your measurement, use a 
nonparamctric test. 

Second (and following from the first), parametric tests assume that the data are: 
normally distributed. You remember that the larger the sample size, the more: 
likely you are to obtain a normal distribution. For many studies in our field, w c 
have only a small number of Ss or a small number of observations. Wc cannot 
always be confident that these will be normally distributed so that the X and s.d. 
are appropriate statistics to describe the distribution. If interpretation of ihe test 
requires that the data be normally distributed and yours are not, you should turn 
to a nonparametric procedure. (If you have an excellent statistical package, you 
computer may be able to tell you if the distribution is or is not normal and just 
how far afield you are when you use a parametric procedure with such data.) 

A third assumption of parametric tests is that we can estimate the distribution in 
the population from which the respective samples have been drawn. In 
parametric tests, the distribution of the data in samples is used to estimate the 
distribution in the population represented by the samples. This is possible only 
because so much is known about the nature of human performance and the 
normal distribution of that performance in very large groups (i.e., the popu¬ 
lation). With small size samples and sample distributions that do not appear 
normal, you might feel some hesitation in applying such tests. Again, 
nonparametric procedures may be more appropriate since they do not rest on 
such estimates. 

A fourth assumption underlying most parametric tests is that the observations 
are independent. The score assigned to one case must not bias the score given to 
any other case. So, for example, if you score compositions, the score you give the 
first composition should not influence the value you give to the second composi¬ 
tion. You know how difficult this is if you have ever taught composition. There : 
is another type of independence required by many statistical procedures. This is: 
that the data in one cell of the design should not influence the data in another 
cell. In repeated-measures designs, the data are not independent. That is, if we 
collect oral data and written data from the same 5s, the data are not independ¬ 
ent. Some of the variability (or lack of variability) in the two things measured 
may have nothing to do with the variables but rather with the fact that the same 
person produced the data. Another example of nonindependence of data is fre¬ 
quency counts of particular grammatical features. When straight frequencies are 
used (rather than a rate of n per some number of words), high frequencies from 
one piece of text may contribute a great deal to the frequency of the structure 
overall. To be independent, each piece of text would have to be tallied as showing 
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presence/absence of the structure (rather than contributing to some overall total 
of the structure for that text type). Independence of data is discussed in different 
ways depending on the statistical procedure. Wc will, of course, give you a 
warning about this for each procedure. In some cases, it will mean selecting a 
nonparametric procedure. In other cases, it will mean selecting a repeated- 
measures parametric procedure. 

If, as we mentioned, nonparametric tests can be selected for nominal, ordinal, and 
even interval data, why would wc not just opt for them all the time? The reason 
is that wc want the most powerful test that we can find to test our hypotheses. 

Power is a technical term in statistics, concerning the probability of a false null 
hypothesis. In selecting a statistical test we want one that gives us very little 
chance of making a claim when wc shouldn't have done so. It is also one which 
gives us little chance of not making a claim when we should have. That is, the 
most powerful test allows us to be sure that when we reject the null hypothesis 
we are correct or that when we accept the null hypothesis we are correct. The 
most powerful test—and in a sense, the best test--is the one where there is the least 
chance of making an error in rejecting or accepting the null hypothesis, 
parametric tests utilize the most information (i.e., the X and s.d. use all the in¬ 
formation in the data) and so they are more powerful when assumptions under¬ 
lying the procedures are met. Also, tests which require normal distribution are 
more powerful because characteristics of the normal distribution arc well-known. 
Iparametric tests, because of this power, allow us to make our claims with confi¬ 
dence. If good design has overcome threats to internal and external validity, and 
if the requirements of parametric tests have been met, we can generalize with 
confidence as well. That is, we can make inferences from our findings to the 
population from which the sample data were drawn. 

IParametric tests are, as a general rule, more powerful than nonparametric tests 
fwhen assumptions underlying the procedures are met. (The power of 
fnonparametric tests does increase rapidly as you increase the number of 5s.) In 
fjiart this is because they are tests which utilize the most information. Also, tests 
fjyhich require normal distribution are more powerful because characteristics of 
|he normal distribution are known. Therefore, parametric tests should be our 
|first choice if the assumptions of the particular procedure are met and if we feel 
fxmfident about the measurement of the variables and if we have overcome the 
Ijnany threats to the validity of the study so that we can generalize our findings. 
As you will see, some of the parametric tests are claimed to be "robust" to vio¬ 
lations of their assumptions. Some computer programs will, in fact, show you 
(now severe a Type 1 or Type 2 error you might commit if you use them in vio¬ 
lation of the assumptions. However, we cannot know what happens if we violate 
iinore than one, or some combination of assumptions that underlie a procedure, 
jt is important, therefore, that we check all the assumptions of any statistical test 
^before applying it to data. 
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Practice 8.5 

1. Think back to the example in chapter 6 (page 178) regarding attitudes 
teacher trainees toward microtcaching. Would you say that the assumption of 

linearity (i.e., equal-interval linearity) for the scale is weak or strong?_ 

Our university uses a set of 9-point ratings of teacher performance. Would you 

judge the linearity of the scale as weak or strong?_. Is the X and s.d. 

an appropriate description of the distribution? Why (not)?_ 


2. Assume that you wanted to give someone a score based on an oral interview.: 
The measure shows the number of words per minute during a three-minute seg-i 
ment of the interview. Then, you also want to give the person a general fluency 
rating. Can this be done so that the two scores are independent? Why (not)? 


► 3. Think about each of the following examples. In your study group, decide 
whether a parametric or a nonparametric procedure seems best for data analysis. 

I .xamplc A 

As part of an evaluation project., you asked high school students 
whether or not they enjoyed studying languages prior to their en¬ 
rollment in language classes. At the end of the year, you again 
asked them if they liked studying languages. There are 25 students 
in all. After you collected their responses, you looked to see how- 
many people who didn't like languages to begin with changed their 
minds anti liked studying them at the end; how many who said "no" 
still said "no"; and how many who said "yes" now said "no"; and 
how many said "yes" and still said "yes" after the course. 

Measurement (nominal frequencies rank-order scales/interval scores):_ 

Type of comparison (between independent groups/repeated-measures/combined 
design): ___ 

Representativeness of sample: _ 

Normal distribution of data expected? _ 

Can wc generalize to all high school students?_ 

Independence of measures?_ 

Choice (parametric/nonparametric) and rationale _ 


240 The Research Manual 




jgxample B 

Trainees in your TEFL program have designed a set of ESL lessons 
based on authentic language materials (materials prepared for na¬ 
tive speakers of the language). These lessons are for beginning stu¬ 
dents. Using "convenience sampling," you find an "intact class" (an 
ordinary, everyday class to which Ss were assigned prior to the 
study) of beginners. Eleven students in this class of beginners use 
the materials for one of the units of the course (a unit on de¬ 
scription). Ten students use the materials normally selected to teach 
this unit. A panel of judges who were unaware of the treatment 
read and ranked all the 5s on their descriptive compositions. The 
best 5 was rated 1, second best 2, and so forth. 

Measurement (nominal frequencies/rank-order scales/interval scores): __ 

Type of comparison (between independent groups/rcpcated-measures/combined 
design): ___ 

Representativeness of sample: 

iNormal distribution of data expected? _ 

§&an we generalize to all ESL/EFL students using such materials? _ 

flndependcnce of measures? _ 

fChoice (parametric/’nonparametric) and rationale_ 


^Example C 

Hensley & Hatch (1986) evaluated the effectiveness of a language 
lab listening comprehension program for Vietnamese students in a 
refugee camp on Bataan, the Philippines. Approximately half of the 
400 students enrolled in ESL had participated in a listening com¬ 
prehension program in the lab while the others had taken regular 
ESL classes. The duration of the program was 10 weeks. While the 
lab was a listening program, the hope was that listening skills would 
transfer to better speaking skills. The question was whether the lab 
group performed as well or better than the regular ESL group. 
Students were given scores for such things as pronunciation, syntax, 
confidence, and communicative ability based on their informal 
conversations with a native speaker. 

Measurement (nominal frequencies/rank-order scales/interval scores):__ 

Type of comparison (between independent groups/repeated-measures/combined 

design): _ 

Representativeness of sample: ____ 
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Normal distribution of data expected? _ 

Can we generalize to all ESL/EFL students using such materials? 
Independence of measures? 

Choice (parametric/nonparamctric) and rationale_ 


Example D 

Having studied conversational analysis, you would like to apply, 
what you have learned to ESL teaching. You wonder whether you 
can effectively teach conversational openings, closings, and turn¬ 
taking conventions in phone conversations. In the first week of 
class you ask students in your class to phone you to talk about their 
goals for the class. In reviewing each call, you assign a score for 
each person's phone skills (reflecting openings, closings, and turn¬ 
taking) on a 10-point scale. During the term, you teach the unit on 
phone conversations that you prepared. At the end of the course, 
you asked the students to call once again and again ranked them 
on the scale. The research question was whether any improvement 
occurred. While there is no control group in-this study, you want 
to share the results of this pilot research with teachers at your next 
ESL in-service. 

Measurement (nominal frcquencics/rank-ordcr scalcs/intcrval scores):__ 

Type of comparison (between independent groups/repeated-mcasurcs. combined 
design): __ 

Representativeness of sample: __ 

Normal distribution of data expected? _ 

Can we generalize to all ESL/EFL students using such materials? 

Independence of measures? _ 

Choice (parametric/nonparametric) and rationale_ 


Example E 

In the study mentioned earlier that evaluated the effectiveness of 
the language lab used by Vietnamese refugee students, we wondered 
whether lab students would improve on general oral language skills. 
This time, 30 pairs of students were matched on the basis of age, 
sex, and scores on a placement test battery. The oral skills test was 
a general oral language test given to all students at the end of the 


242 The Research Manual 



total 12-week course. Two teachers scored each test and an average 
score for each S was computed. Teachers did not know which stu¬ 
dents were lab students. 

Measurement (nominal frequencies/rank-order scales/interval scores):_ 

Type of comparison (between independent groups/repeated-measures/combined 
design): - 

Representativeness of sample: _ 

Normal distribution of data expected? _ 

Can we generalize to all ESL/'EFL students using such materials? _ 

Independence of measures? _ 

Choice (parametric/nonparametric) and rationale_ 


Would your choice change if one person was ill on the final day and so only 29 
matched pairs of students took part in the study? Why (not)?_ ’ 


ooooooooooooooooooooooooooooooooooooo 

Conclusion for Part II 

tfrPthe first section of this volume, we presented some of the basic principles of 
research design. Data description has been the focus of part II. In this final 
chapter we have introduced the notion of using statistical tests to give us confi¬ 
dence that our descriptions are correct and/or that we can make inferences from 
the data. 

We have discussed some of the issues that guide the researcher in determining the 
most appropriate statistical test for the data. The first two have to do with de¬ 
sign, the third with measurement, and the last has to do with selection between 
parametric and nonparametric options: 

1. Number of dependent and independent variables (and the number of levels 
within each variable)? 

2 Comparison between groups or repeated-measures? 

0. Level of measurement (frequencies, rank-order scales, interval scores)? 

§. Assumptions of parametric tests 
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a. Truly continuous dala? 
h. Normal distribution? 

c. Equal variances? 

d. Independent observations? 

Parametric procedures should always be our choice when (a) all the assumptions' 
of parametric procedures have been met and (b) we wish to draw inferences 
about the population from which our sample data were drawn. This docs not 
mean that parametric procedures cannot be used for descriptive purposes. They^ 
can (and should) be t/(and that is an important "if") the assumption behind such 
procedures have been met. However, when parametric procedures arc used for 
inferential purposes (when we want to generalize), then not only must the as- : 
sumptions of parametric tests be met but the design must also allow for goner-: 
alization. That is, threats to internal and external validity of the design must: 
have been met. 

Part III of this volume will cover statistical tests that allow us to draw compar-: 
isons between and/or among groups. Options will be given to allow you to select 
the most appropriate procedure, parametric or nonparametric, for such compar¬ 
isons. 


Activities 

1. V. Nell (19X8. The psychology of reading for pleasure: needs and gratification.; 
Rcadit.ii Research Quarterly. 23. 1,6-50.) studied people who are sometimes called : 
ludic readers-bookworms who read at least a book a week. Among many 
findings, such readers preferred to read materials which they themselves had: 
judged to be devoid of merit. The null hypothesis for this part of the study might 
have been 'There will be no difference in preference for books which have or have 
not been judged to have literary merit." The null hypothesis was rejected at the 
.001 level. How much confidence can you have in this finding (i.e., how many 
chances are there that the author may have made a mistake in rejecting the null 
hypothesis)? 

2. D. Ilyin, S. Spurling. & S. Seymour (1987. Do learner variables affect cloze ; 
correlations? SYSTEM. 15, 2, 149-160.) studied the correlation of cloze test scores 
with scores in other skill areas. (Correlation is a statistical measure of relation¬ 
ship between variables. If you want to know more now, flip ahead to chapter 14.) 
By grouping the learners, they were able to compare the correlations obtained for 
different types of learners. One of their findings was that cloze test score and 
listening test score correlations were higher for young learners than for older 
learners. Fisher's z indicated the difference was significant (p < .05). How much 
confidence can you have in this finding (i.e., how many chances are there that the 
authors might have been wrong in rejecting the null hypothesis of no difference 
in correlations for older and younger learners)? 
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3 , J. M. Fayer & E. Krasinski (1987. Native and nonnative judgments of intelli¬ 
gibility and irritation. Language Learning, 37, 3, 313-326.) asked native and 
nonnativc (Spanish) speakers of English to judge the speech of (Spanish) ESL 
students. In addition to these judgments, they asked whether errors distracted 
them or annoyed them. The study shows that nonnative judges reported more 
annoyance with ESL errors than native speaker judges did (p < .05). The null 
hypothesis could not be rejected for "distraction." That is, there was no signif¬ 
icant difference in number of reports of distraction in response to the tapes on the 
Ipart of native and nonnative speaker judges. State the null hypothesis for 
"annoyance." How much confidence can you place in the finding that native 
speaker judges reported less annoyance than nonnative judges? 

4. M. Eisenstein, S. Shuller, & J. Bodman (1987. Learning English with an in¬ 
visible teacher: an experimental video approach. SYSTEM, 15, 2, 209-216.) 
evaluated an experimental treatment described in the article. They also looked 
at learner variables to sec if they might have affected the outcome on the final test 
score (Ilyin Oral Interview, 1972). Look at each of the following questions and 
convert them to null hypotheses. Then interpret the probability figures. Which 
fvvould allow you to reject the null hypothesis? How many chances of being wrong 
(Would you have if you rejected the null hypothesis in each case? 

Do females and males score differently? p < .646 
Do 5s from different socioeconomic groups score differently? p < .05 
Do 5s who plan to stay in the United States score differently than those 
who do not? p < .258 

Some 5s prefer that the teacher correct them while other 5s prefer teacher 
and peer correction. Do these two groups score differently on the test? 
p < .284 


I References 

(Hensley, A. & Hatch, E. 1986. Innovation in language labs: an evaluation report. 
(Report for Philippine Refugee Processing Center, 1CNC, Bataan, Philippines. 

(Ilyin, D. 1972. The Ilyin Oral Interview. New York, NY: Newbury House. 
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Part III. Comparing Groups 




Chapter 9 

Comparing Two Groups: 
Between-Groups Designs 

•Parametric comparison of two groups: t-test 
Sampling distribution of Xs 
Case / comparisons 
Case 2 comparisons 

Sampling distribution of differences between Xs 
Assumptions underlying t-tests 
Strength of association: eta 2 
•Nonparametric comparisons of two groups 
Selecting the appropriate nonparametric test 
Median test 

Rank sums tests (Wilcoxon, Mann Whitney U) 

Strength of association: eta 2 
•Deciding among procedures 

In the previous section we discussed the major questions that lead to the selection 
of an appropriate statistical test. The first question has to do with the number 
of variables and their function in the research. A second question has to do with 
the type of measurement used. A third question is whether the data come from 
two different groups (a between-groups design) or are two or more measures 
|aken from the same group (a repeated-measures design). In this chapter we will 
discuss tests involving one independent variable with two levels and one depend¬ 
ent variable. That is, we will compare the performance of two groups on some 
dependent variable. The measurement of the dependent variable will be contin¬ 
uous (i.e., interval scores or ordinal scales). The comparison will be of two dif¬ 
ferent groups (a comparison of independent groups, a between-groups design). 
Repeated-measures designs are discussed in chapter 10. 

There are several options available to us for comparing two groups. The choice 
has to do with the type of measurement and the best estimate of central tendency 
for the data. We will begin with the /-test, a procedure that tests fhe difference 
between two groups for normally distributed interval data (where X and s.d. are 
appropriate measures of central tendency and variability of the scores). Then 
we will turn to procedures used when the median is the best measure of central 
tendency or where certain assumptions of the /-test cannot be met. 
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Parametric Comparison of Two Groups: t-test 


We have discussed at length how individual scores fall into the normal distri- I 
tuition. In research, we arc seldom interested in the score of an individual stu* 
dent; rather, we arc interested in the performance of a group. When we have 
collected data from a group, found the X and s.d. (and determined that these arc 
accurate descriptions of the data), we still want to know whether that A is ex* 
ceptional in any way. To make this judgment, wc need to compare the mean with • 
that of some other group. 


In Case I studies, we compare the group mean with that of the population from • 
which t was drawn. Wc want to know whether the group X is different from 
that of the population at large. In Case 2 studies wc have means from two groups ; 
(perhaps an experimental group and a control group). Wc want to know whether 
the means of these two groups truly differ. 


Let's imagine that you have been able to obtain a special computer-assisted j 
videodisc program for teaching German. It's an expensive program, and your : 
department chair asks for evidence that the program has been effective. You co \ 
have the results of a comprehensive vocabulary and structure test given at the 
end of the term. You^an show the chair the X and s for the class on this test. 
You can say that the X of 80 looks much higher than the published mean of 65 
for the test. You know he will just say, "Fifteen points? How much better is 
that?" Since there are other sections of German I, any of which could serve as a i 
control group, you immediately ask a fellow teacher if he has the scores for his i 
class. Unfortunately, he hasn't had time to score them and the chair needs the j 
information by this afternoon. What can you do? Well, it is possible that ycu j 
might do a Case 1 f-test. Some background is needed for us to understand how j 
this works. 


Sampling Distribution of Means 


Assume there are 36 5s represented m the X for your class. We need to compare: 
this mean with that of many, many X s from other groups of 36 5s. Imagine that 
we could draw samples of 36 5s from many, many otherJ3erman I classes ar.di 
that wc gave them the test. Once wc got all these sample Ys, we could turn them 
into a distribution. 


The norma! distribution is made up of individual scores. This distribution would : 
differ in that, instead of being made up of individual scores, it is composed of : 
class means. As wc gathered more and more data, we would expect a curve *o ; 
evolve that would be symmetric. This symmetric curve is, however, not called a 
normal distribution but rather a sampling distribution of means. 


If wc were able to build this sampling distribution of means, we could then see | 
where the X for our German class fit in that distribution. We hope it would place 
at the right of the distribution, showing that our class scored higher than other ; 
classes. If the sample X fell right in the center of the sampling distribution of ; 
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means, we would know that it is typical of all such classes, no better and no 
worse. If the sample A fell at the far left tail of the distribution, we would in¬ 
terpret this finding to mean that our class scored much lower than other classes. 

If you had the Xs for the test from many, many schools and you plotted out the 
sampling distribution of means, you might be surprised at the results-especially 
if you compared the spread of scores within the sample from your class and the 
spread of As in the distribution of means. If you actually plotted the means from 
all the groups, you would immediately notice how much more similar they arc to 
each other-much closer to each other-than the individual scores in your class are 
'to the group mean. 

Stop to think about it for a moment, though, and it will make sense. When we 
compute a mean (A'), the individual differences are averaged out. The high and 
low scores disappear and we are left with a central measure. The X is Jhe most 
central score for the group. So, since our new distribution is made of Xs, it will 
naturally be much more compact than scores in a single distribution. Therefore, 
the standard deviation will be smaller in the distribution of As. 

The size of the groups will also influence the sampling distribution of means. The 
; larger the N for each group, the more the As will resemble each other. This is 
because the more scores there arc in each group, the greater the chance of a 
normal distribution. And, with a normal distribution, the X becomes quite pre¬ 
cise as a measure of central tendency. The As of large classes, therefore, should 
be very similar to each other (if they are all from the same population). 

The following figures may clarify this difference in distributions. The first 
distribution is typical of a normal distribution of individual scores in a class of 
30 5s. The second is typical of a sampling distribution of means for groups of 
30 5s each. The final figure is typical of a sampling distribution of means for 
groups of 100 5s each. 



class of 30 groups of 30 groups of 100 


When we take the average of a group of scores, wc call that central balance point 
the mean and we use the symbol A for the mean. When wc take the average of 
a group of means, we call that central balance point the population mean (not the 
"mean of means"!). And the symbol for the population mean is p (the Greek 
fetter mu). 

fFhe reason the mean for the groups is called the population mean is that it is 
$rawn from a large enough number of sample groups (selected at random from 
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the population) that it forms a normal distribution. Its central point should be ! 
equal to that of the population. 

The sampling distribution of means has three basic characteristics. 

1. For 30 or more samples (with 30 or more Ss per sample), it is normally dis¬ 
tributed. 

2. Its mean is equal to the mean of the population. 

3. Its standard deviation, called standard error of means, is equal to the stan¬ 
dard deviation of the population divided by the square root of the samp:e 
size. 

The third characteristic may not be crystal clear. Why isn't the standard error 
of the means equal to the standard deviation of the population ? I hink back a 
moment to the two figures you just chose to represent the distribution of the 
means based on groups of 30 5s each and that of groups of 100 5s each. The 
sampling distribution of means depends on the size of the sample groups. The 
two figures (for groups of 30 vs. groups of 100) differed in the amount of spread ; 
from the central point, with the distribution representing groups of 100 being 
much more compact. Therefore, to make the standard error of means sensitive 
to the N size of the samples from which the means were drawn, vve divide it by 
the square root of the sample size. 

Now, the hard part. While you already have the published population mean for 
the German test from the publishers, you do not know the standard error of 
means for the test. You don't have time to find 30 other classes with 36 5s each.: 
There is a much easier way. We will estimate it, and that estimate will be quite 
precise. 

When we carry out research, we gather data on a sample. The information that 
we present is called a statistic. We use the sample statistic to estimate the same 
information in the population. While a statistic is used to talk about the sample, 
you may see the term parameter used to talk about the population. (Again, pa¬ 
rameter is a lexical shift when applied to linguistics. The two have little to do 
with each other.) 

Perhaps the following diagram will make the picture clear: 

Sample Population 

l t 

Statistic -* Estimates -» Parameter 

Let's see how this works for estimating the standard error of means, the param¬ 
eter <sj (the Greek symbol is a small sigma): 
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nee we will use our sample data to estimate the parameter, we use our sample 
atistio' for the formula: 


|e formula says that wc can find the standard deviation of the means of all the 
groups (called now the standard error of means) by dividing the standard devi- 
lion of our sample group by the square root of the sample size. Sample statistics 
| the best available information for estimating the population parameters. The 
ifrespondcncc between population parameter symbols and sample statistic 
symbols is: 


lie standard error of means becomes a ruler for measuring the distance of our 
ipplc mean from the population mean in the same way that standard deviation 
£s a ruler for measuring the distance of one score from the mean. The standard 
for ruler will, however, always be very short in comparison to the standard de- 
ation ruler. This is because the sampling distribution of means is more tightly 
iistered and thus shows less spread from the X. If this does not make sense, 
ink back to the three distribution diagrams on page 251 and then discuss the 
atter in your study group. 

ow let's apply all this information to a Case 1 /-test procedure. 


ase 1 Comparisons 

gis Case 1 study compares a sample mean with an established population mean. 
|he H 0 for a Case 1 /-test would be: There is no effect of group on the dependent 
fariable. That is, the test scores result in no difference between the sample mean 
&ttd the mean of the population. 


1 


o discover whether the null hypothesis is, in fact, true, we follow a familiar 
ocedure. First we ask how far our sample X is from //. Then we check to see 
ow many "ruler lengths" that difference is from the mean. The ruler, this time, 
fisthe standard deviation of the means rather than the standard deviation, right? 
he formula for our observed / value is: 


If the X for our class was 80 and the published p for the test was 65, you can fill 
in the values for the top part of the formula: 
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To find the value lor s^,, refer back to our formula for the standard error of 
means. The standard deviation for the scores of your German students was 30 ; 
The class si/e was 36. So, 


30 


*obs ~ 
*obs = 


X-H 


80-65 


30-h Jib 


'*.<=*■ o 


That's the end of our calculations, but what does this t value tell us (or the chair 
of your German department)? Visualize yourself as one of many, many teachers • 
who have given the German vocabulary test to groups of exactly 36 students, i 
All of the means from these groups have been gathered and they form a sampling 
distribution of means. The task was to find exactly how well the X of your class 
fits in this distribution. Can you say that they are really spectacularly better--:.c., 
your class X falls so far out under the right tail of the distribution that they 
"don': belong," "are not typical"? Can you reject the null hypothesis? 


Before we can answer, there is one more concept we need to present-the concept 
of degrees of freedom (df). You already know that the t value is influenced by 
the sample size. Sample size relates to degrees of freedom. You'll remember in 
several of our formulas, we averaged not by dividing by N but by N — 1. This, 
too, is related to degrees of freedom. 

Degrees of freedom refers to the number of quantities that can vary if others are 
given For example, if we know that A + B = C, we know that A and B are free 
to vary. You can put any number in the A and B slots and call the sum C. 3ut 
if you change C to a number so that A + B - 100, then only one of the numbers 
(A or B) can vary. As soon as you plug in one number, the other is set. If A = 
35, then B is not free to vary. It has to be 65. Only one was free, the other is 
fixed. So we say there is one degree of freedom. If we said that A + B + C = 
100. then two of the numbers (for A, B, or C) can vary and the third is fixed. 
1 wo are free-two degrees of freedom. To find the degrees of freedom we sub¬ 
tract .V I. This concept of degrees of freedom also applies to groups. The 
number of degrees of freedom is important, for it determines the shape of the t 
distribution. Mathematicians have already described these distributions for us 
according to the degrees of freedom. And, fortunately for us, mathematicians 
have also worked out the probabilities for each of these distributions. We are in 
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luck. We can now find llic df fur oui study and check, the probability to deter¬ 
mine whether we can or cannot reject the null hypothesis. 

ifo obtain the df vve use N - 1. The N of our sample was 36 so df 35. To find 
the probability, we will consult the appropriate distribution in table 2, appendix 
q Now we can find the answer to our question. 

|n the tabic, notice that the probability levels arc given across the top of the table. 
In the first column are the degrees of freedom. If the df were 1. you would look, 
at the values in the first row to determine whether you could or could not reject 
the null hypothesis. For example, if you chose an .05 level for rejecting the null 
hypothesis, you would look across the first row to the second column. The 
number 12.706 is the t critical value that you need to meet or exceed in order to 
reject the null hypothesis. If you chose an .01 a (rejection point), you would need 
a t value of 63.657 or better. If your study had 2 degrees of freedom, you would 
need a t value of 4.303 or better to reject the null hypothesis at the .05 level. 

Let's see how this works with our study. We have 35 degrees of freedom. Look 
down the df column for the appropriate number of degrees of freedom. Unfor¬ 
tunately, there is no 35 in this particular table. If there is no number for the df 
of your study, move to the next lower value on the chart (as the more conservative 
estimate). The closest df row to 35 df in this table is 30. Assuming you chose an 
.05 a, look at the critical value of t given in the column labeled .05. The inter¬ 
section of df row and p column shows the critical value needed to reject the null 
hypothesis. 

The table, then, shows that you need to obtain or exceed a t value of 2.042 to 
reject the null hypothesis. For an .01 level, you would need to meet or exceed a 
t value of 2.750. These cut-off values are called the critical value of t or t 
critical. When the observed value of t meets or exceeds the critical value for the 
level selected (.05 or .01), you can reject the null hypothesis. (The critical value 
works for either positive or negative t values.) 

We can reject the null hypothesis in this case because our observed t value of 3.0 
exceeds the critical value (t critical) of 2.042 needed for a probability level of .05. 
In the research report, this information would be given as the simple statement: 
t = 3.0, df = 35, p < .05. We can reject the H 0 and conclude that, indeed, our 
German class excelled! 

The probability levels given in this table are for two-tailed, nondirectional hy¬ 
potheses. If your particular study has led you to state a directional, one-tailed 
pypothesis, you need to "double" these values. That is, a one-tailed hypothesis 
puts the rejection under only one tail (rather than splitting it between the two 
pails). So, to find a one-tailed critical value for an .05 level, use the .10 column. 
To set an .01 rejection level, use the column labeled .02. Thus,.for 1 df and a 
one-tailed .05 rejection area, you need a t critical value of 6.314 or better. For 
an .01 level, the one-tailed critical value of t would be 31.821. If your study had 
T6 dfand you wished to reject a one-tailed hypothesis at the .05 level, you would 
peed a t value of 1.746 or greater. If your study had 29 df, a t critical of 2.462 
fwould be needed for rejection at the .01 level. For the more normal two-tailed 
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hvpothesis at the .01 level, the values needed would be 2.921 for 10 df, and 2.75b 
for 29 df. 

Now let's review a few of the points discussed in the chapter on design. You will 
remember that there arc many pitfalls in planning research. In this case, even if 
you find that the class did remarkably well on a published test, you cannot say 
that this "proves" the materials arc effective. When research has been conducted: 
with intact classes, and when we have not examined possible threats to internal 
and external validity, the results must be interpreted with care (i.c./ 
conservatively). However, the results do allow you to reject the null hypothesis 
with confidence and conclude that the performance of the class is statistically' 
better than the published mean. What caused the difference is open to interpre¬ 
tation. 


ooooooooooooooooooooooooooooooooooooo 

Practice 9.1 

► l. Use the /-test table in the Appendix to determine the t critical value needed 
to reject the following null hypotheses: 

_ 14 df, .05 or. one-tailed hypothesis 

_22 df, .01 a, one-tailed hypothesis 

_55 df .05 a, one-tailed hypothesis 

_10 df, .03 ot, two-tailed hypothesis 

_27 df, .01 a, two-tailed hypothesis 

2 . If you were to teach the course again, and if there were three other sections 
of German I, how might you design an evaluation that would be more convinc¬ 
ing? (You might be able to persuade your chair to contact the program designers 
to see if they would allow you to keep the materials in exchange for a formal 
evaluation!) 

Redesigned evaluation_ 


► 3. Suppose there are 14 students in your class. The school requires that stu- 
dents take a standard reading test at the beginning of the term. The X for your 
class is 80 and the s is 5. I he published mean for your class grade on this na-: 
tional test is 85. To paraphrase Garrison Keillor, you thought all your children 
were "better than average." Were they? 
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I Research hypothesis? There is no effect of group on reading 

If achievement (i.e., there is no difference in the reading achievement 

If scores of the two groups), 
j Significance level? .05 

l- or 2-tailed? 2-tailcd 
Design 

Dependent variable(s)? Reading achievement 
§:. Measurement? Scores (interval) 

Independent variable(s)? Group 
Measurement? Nominal (class vs. population) 

Independent or repeated-measures? Independent 
Other features? Intact class (no random selection) 

Statistical procedure? Case 1 r-test 

Box diagram: 

CLASS POP 


file formula again is: 


{ obs 


Enter your calculations below: 


|How many df are there (N — 1)?_. Look at the r-test table in the Ap- 

pendix. Find the df in the first column. Then find the intersection with the sig¬ 
nificance level you chose. What critical t value is listed?_. (You can 

disregard the negative sign on t observed.) Does your observed t value exceed the 
critical value of /?_ 

; 4. Think for a minute about the design_of this study. If your class X were sta¬ 
tistically different from the published X, could you have attributed this sterling 
performance in terms of your classroom instruction? Why (not)?_ 


5. Notice that the standard deviation for your group was relatively small. The 
class members all performed in a similar way, so you can't say that the mean was 



X-p 

s x 
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biased by extreme scores of some few students. It looks like Garrison Keillor was 
wrong-the group was not better but worse. What other factors might have been 
involved in poor performance on the test?_ 


6 . You may have noticed an interesting phenomenon in checking the f-test table 
in the Appendix. As the sample size becomes smaller, the t critical value for re¬ 
jecting the null hypothesis becomes higher. The smaller the number of S.?, the 
larger the differences must be between the two means. Discuss this phenomenon 
in your study group. Write your consensus as to why this should be so. 


ooooooooooooooooooooooooooooooooooooo 

Case 2 Comparisons 

Case 2 studies are much more common in applied linguistics research than are 
Case 1. We often want to compare the performance of two groups. Perhaps the 
scores are of an experimental group vs. a control group, Spanish vs. Korean ESL 
students, or field dependent vs. field independent learners. 

Let's assume that you have two bilingual classrooms. In one classroom, instruc¬ 
tion is conducted in Spanish in the morning by one teacher and in English in the 
afternoon by a second teacher. In the second classroom, two teachers take turns 
instructing children throughout the day. One uses Spanish and the other uses 
English. At the end of the year,_the children are tested in each language. The 
X for the first class is 82.7; the X for the second is 78.1 (and the distribution 
shows that X and s.d. are appropriate measures of central tendency and vari¬ 
ability or dispersion for these data). The question is whether the test performance 
of the children in the two classrooms differed significantly. By "eyeballing" the 
means, it appears that the first class "did better" than the second. 

As you might imagine, the process by which we test the difference between these 
two groups requires that we look not only at the difference between the two 
means but that we also place that difference in a sampling distribution of such 
differences. This procedure will differ from that described for a Case 1 f-test 
because instead of placing a mean in a distribution of Xs, we want to look at a 
distribution of differences between A’s. 

The basic problem is finding this sampling distribution. Once again, we will use 
sample statistics to estimate population parameters. Using our statistics, we will 
estimate the differences that we would get if w r c found and tested another two 
classes and compared their means, then found another two classes and tested 
them to compare their means, and so on until we felt we had tested the popu¬ 
lation. We would find the difference between each pair of classes on the test and 
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nlot these in a sampling distribution. Then we would place our difference in 
fneans in that distribution and decide whether the difference "belongs" or "is 
typical" of that distribution. If it is, we will not be able to reject the null hy¬ 
pothesis because the difference is normal (not extreme) for the distribution of 
differences. 


Sampling Distribution of Differences Between Means 

Whenever we want to compare the means of two different groups, we can visu¬ 
alize the procedure where we collect data on the test from two classes, another 
two classes, and another two until we have enough differences between means to 
represent the population. We compute the differences between the means for 
each set of two classes. Then we construct a frequency distribution of all these 
differences which will be called a sampling distribution of differences between 
means. The distribution--if it includes many, many differences between means- 
should have the following characteristics: 

1 . The distribution is normally distributed. 

2. It has a mean of zero. 

3 . It has a standard deviation called the standard error of differences between 
means. 

The distribution will be bell-shaped, and we will need a "ruler" to discover the 
place of the difference between our two means. We will use one which measures 
the standard error of difference between means. This should ring a bell. This is 
the third time around for finding the place of our data in a distribution in exactly 
the same way. 

To review, to find an individual score in a normal distribution , we use a z-sc ore 
formula: 


difference between score and mean 
standard deviation 


To place a sample mean in a distribution of means , we used a Mest formula that 
said; 


difference between sample mean and population mean 
standard error of means 


or 
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Now, in comparing the difference between two means , the formula is: 

f diff. between 2 means diff. between 2 population means 
standard error of differences between means 

Now, since we believe that the difference between the two population means \vil : 
be zero (because they arc from the same population), wc can immediately simplify 
this formula by deleting the second part of the numerator. 

difference between 2 sample means 

l —-;- 

standard error of diff. between means 


(= e c 
^ Xe ~xJ 

l.el's ; pply this formula now to some data. Lazaraton (1985) planned to evaluate 
a set rf authentic language materials used in a beginning-level LSI. class. There 
were two sections of beginning HSI., one of which would form the experimental: 
group and the other the control group. While students were placed in the class 1 
on the basis of a placement test, it was important to equate the listening skills of : 
the two classes prior to giving the instruction. Here is a table showing the scores 
of the two groups: 

ESLPE Listening Scores & Total Scores 
Two Classes 

s t value df p 

4.0 
4.7 


7.7 
8.4 

Let's fill out the chart for this example. 


Group 

n 

Mean 

LISTENING SCORE 


Control 

19 

11.7 

Exper. 

19 

10.5 

TOTAL 

ESL SCORE 


Control 

19 

66.9 

Exper. 

20 

63.4 
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Research hypothesis? There is no effect of group on listening com¬ 
prehension (i.e., there is no difference in the means of the exper¬ 
imental and control groups) 

Significance level? .05 
I- or 2-tailed? 2-tailed 
Design 

Dependent variable(s)? Listening comprehension 
Measurement? Scores (interval) 

Independent variable(s)? Group 
Measurement? Nominal (experimental vs. control) 

Independent or repeated-measures? Independent 
Other features? Intact groups 
Statistical procedure? /-test 


Box diagram: 

EXPF.R CONT 

x I I 


In statistics books, you may find the null hypothesis stated in the following way 
jor the /-test: 

H 0 = The two samples are from the same population; the difference 
between the two sample means which represent population means 
is zero (p, — p 2 = 0). 

ffhe null hypothesis says that we expect that any difference found between the 
jitWo sample groups falls well wiLhin the normal differences found for any two 
ilheans in the population. To reject the null hypothesis, we must show that the 
difference falls in the extreme left or right tail of the distribution. 

Here is the formula: 

* e -*c 


The denominator is the standard error of differences between means. The sub- 
;§feripts in the numerator just identify one mean as coming from the experimental 
;|roup and the other from the control group. 

The numerator is the easy part. The difference between the two means (10.5 - 
#1.7) is —1.2. The question is whether that difference is significant. To find out, 
we must place this difference in a sampling distribution and discover how far it 
dS from the central point of that distribution. We measure the distance with a 
Kruler"--in this case the standard error of differences between means. 

The denominator is the ruler. You remember from our discussions of Case l 
Studies that we will use sample statistics (in this case, the standard deviations 
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from the mean of the two groups) to estimate the mean of the population. wi: 
know that in making this estimate, we need to correct the "ruler" for the size o 
the classes (the n s of the two groups). 

The formula for the "ruler" is: 


Since we have the s and the n for each group, wc call fill in this information as 
follows: 


_ M - | 4.0* 

V „ l9 

Use your calculator to compute the value for the standard error of differences 
between the means. It should be: 


Now that we have the values of both the numerator and the denominator, wc can 
find the value of / observed. 


l obs = - 845 


At this point, wc will place the observed value of / into the sampling distribution 
of differences between means for the appropriate degrees of freedom. Since there 
were 19 5s in each group for this test, each group has 18 df. Wc add these to¬ 
gether :o get the df for the study. 18 -r 18 = 36 df. Another way of saying the 
same thing is df = n x + n 2 - 2 . We turn to the /-test table in the Appendix and 
look at the intersection of 36 df and .05. It's not there, so we choose the 30 df row 
instead (to be more conservative). The t critical needed for rejection of the null 
hypothesis is 2.042. We cannot reject the null hypothesis. This information may 
be presented in the simple statement: / = .845, df = 36, p - n.s. The abbrevi¬ 
ation n.s. stands for a "non-significant difference." If there is sufficient space, a 
table such as the following may be used to give the same information. 


ESLPE Listening Scores 
Two Classes 

Mean s t value 

11.7 4.0 .845 

10.5 4.7 
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The conclusion we can draw about the differences between these two groups be¬ 
fore the treatment began is that their listening scores did not differ; the groups 
are of approximately the same listening ability. (We cannot say, however, that 
they are significantly similar since we have only tested the null hypothesis of no 
difference.) 


ooooooooooooooooooooooooooooooooooooo 

practice 9.2 

> i. Perform a f-test to test the difference between the two groups on the total 
ESL test scores given in the data table on page 260. Show your calculations. 


Can you reject the null hypothesis? Why (not)? 


What can you conclude? 


ooooooooooooooooooooooooooooooooooooo 

Assumptions Underlying t-tests 

; before applying any statistical test it is important to check to make sure the as¬ 
sumptions of the test have been met. Because the Most is meant to compare two 
means, it is very widely used. Unfortunately, it is also the case that it is very 
widely misused. The following assumptions must be met. 

Assumption /: There are only two levels (groups) of one independent variable to 
compare. For example, the independent variable "native language" can be de¬ 
fined as Indo-European and non-Indo-European. There are only two levels of the 
variable. However, if we defined "native language" by typology as subject- 
prominent, topic-prominent, and mixed (shows features of both), there are now 
three levels of the variable. Only two levels can be compared in the Mest proce¬ 
dure. You cannot cross-compare groups. This means you cannot compare group 
1 and 2, 1 and 3, and then 2 and 3, etc. If you try to use the /-test for such 
comparisons, you make it very easy to reject the null hypothesis. (For example, 
if you set the significance level at .05 and then run four comparisons, you can 
check the level by the formula a = 1 — (1 - ay where the c refers to the number 
fjf comparisons. So for four comparisons, the actual level would be 
a = 1 - (1 — .5)* = 1 — (.9b) 4 = 1 — .82 = .18. So the probability level has 
^hanged from .05 to .18.) In at least one of the comparisons you will not know 
lyhich significant differences are spurious and which are not. Thus, interpretation 
^becomes impossible. 
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Assumption 2: Each S (or observation ■ n assigned to one and only one group. 
That is, the procedure is not appropriate for repeated-measures designs, There' 
is another procedure, the matched-pairs /-test, for such designs. 

Assumption 3: The data are truly continuous (interval or strongly continuous 
ordinal scores). This means you cannot do a /-test on raw frequencies. You may 
be able to convert frequencies to continuous data by changing them to rates or 
proportions but that conversion must be justified. In addition, you must be able 
to show that the converted data approaches interval measurement. If you hav e 
ordinal data or converted frequency data, you may question whether the scale is 
continuous and/or whether the data are distributed normally across the points 
of the scale (a common problem with 5-point or 7-point scales). In such cases,: 
the median rather than the mean may be the better estimate of central tendency: 
and you would be much wiser to use a nonparametric test (e.g., Median or Rank 
Sums). 

Assumption 4: The mean and standard deviation are the most appropriate mea¬ 
sures to describe the data. If the distribution is skewed, the median is a more 
appropriate measure of central tendency. Use a nonparametric procedure (Me¬ 
dian or Rank Sums) for the comparison. 

Assumption 5: The distribution in (he respective populations from which the sun- 
pies were drawn is normal, and variances are equivalent. It is, indeed, difficult to 
know if the distribution in the population is or is not normal. This is a special 
problem when -Vs are not randomly selected but come from intact groups (i.c., o ir 
l.SI. classes). For example, it would be cifficult to know if population perfor¬ 
mance on a listening comprehension test of groups of ESL students is or is not: 
normal. If the test used by l.azaraton in the previous example (page 260) were: 
administered to successive groups of .Vs (16 .S‘s in each group), we could begin to 
build a sampling distribution of means. It's possible that the result would be 
normal but it's also possible it might not be. For example, it's possible a bimodal 
distribution would occur if, lot example, groups of immigrant students clustered 
around the top end of the distribution and groups of foreign students around the 
bottom. You might not be at all interested in the immigrant/foreign student di¬ 
chotomy but, because the data come from intact groups which show these char¬ 
acteristics (rather than from random samples), the data might not be normally 
distributed. As for the equal variance assumption, statisticians state that, if the 
n size of the two groups is equal, the /-test is "robust" to violations of this as-: 
sumption. Magically, /-test procedures in some computer packages (e.g., SAS 
and SPSS-X statistical packages) give information that allows you to judge 
normality of distribution and equivalence of variance. 

It's extremely important to observe the assumptions underlying the /-test proce¬ 
dure. Failure to do so could lead you to claims that are not warranted. Even 
though the claims might have been correct if another statistical procedure were 
used, the misapplication of any statistical test may cause others to lose confidence 
in your work. Unfortunately, the computer will never tell you if you have vio¬ 
lated the assumption of any test you ask it to run. You must check this yourself. 
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You wight wonder why we spend all this time and effort. Why can t we just 
"eyeball" the two means and tell if they arc truly different? Ii is possible that the 
means of two groups look different and yet they might not be. The diagram be¬ 
low helps explain why this is so. 



iWhite the A"s appear far apart, their distributions overlap. Individual scores in 
one distribution could easily fit into those of the other distribution. They almost 
Iform one single distribution curve. 

■The next diagram shows why two means that look quite similar might, in fact, 
be different. 



While the Xs appear close together, their distributions overlap only slightly. The 
scores in one distribution do not appear to fit into the other. 

Think back to your explanation of why the critical values required for an .05 level 
|qif significance become higher as the n in the groups becomes smaller. Remember, 
(too, that the "ruler" by which we measure the placement of the difference relative 
the normal distribution also changes with group size. Can you weave this into 
your understanding of why a statistical test is necessary if we are to know 
Whether differences are or are not significant? (You might want to try reworking 
fine of the problems for a Case 2 /-test and increase or decrease the n size for each 
(group. What happens?) 

(You can look at two means and say that they are numerically the same or differ¬ 
ent. However, there is no way to look at two means and conclude they are sta¬ 
tistically the same or different without using a /-test. (If you understand how a 
/-test is done and have additional information on n size and s , you can, of course, 
make a very good guess.) 


Strength of Association: eta 2 

(When we use a /-test to compare two groups, we obtain a figure which does or 
(does not allow us to reject our H 0 . By rejecting the H 0 , we can show that there 
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is an effect of the levels of the independent variable on the dependent vai table. 
The two levels differ in their performance on the dependent variable. 

That may be all that you wish to show. If you have compared, for example, fe. i 
males vs. males (independent variable sex) on a language aptitude test and 
found that there is a statistical difference, that may suffice. However, in the spirit ; 
of research as a process of enquiry, you might wonder just how much of the dif¬ 
ference in the performance of the two groups is really ielated to male vs. female 
differences. You reason that there is a natural distribution of spread of scores in 
the data of each group which may be related to being male or female but that ; 
there arc probably other factors (i.c., “error" as far as your research is concerned) : 
that a^so contributed to pushing apart the means of the two groups. 

When the sample statistic is significant, one rough way of determining how much ! 
of the overall variability in the data can be accounted for by the independent 
v ariable is to determine its strength of association. For the /-test and matched 
/-test this measure of strength of association is called eta squared (rj 2 ). It's very ; 
simple to do. The formula uses values that can be found in your /-test ealeu- ; 
lations or your computer output. 


Imagine that we had carried out the above study, comparing the performance of : 
males and females on a language aptitude test and that we obtained a t v alue of : 
4.to. I here were 20 .Vs in our study, 10 females and 10 males. The df for the ' 
study are, therefore, (10 I) t (10 1) or 18 <//. Let's determine the strength 

of association between sex and performance. 


465 

4.65 2 + 18 
n 2 =.>i 

Thu- ri 2 of .55 in the above example is a very strong association, it tells us that 
55% of the variability in this sample can be accounted for by sex. (45% of the 
variability cannot be accounted for by the independent variable. This variance : 
is vet to be explained.) 


One reason eta 2 is such a nice measure to have is that it lets you think more about 
findings and how you might rework the study. For example, if you obtained a ; 
significant effect (i.c., could reject the ll 0 on the basis of the /-test) yet found that 
the strength of association did not allow you to account for more than say 5 or 
10% of the variability in performance, you might think about what other special 
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W' 


I features the sample showed. You might find that some of the 5s seemed to have 
|J high test anxiety and others not. If you could remove this factor, perhaps sex 
would turn out to be a stronger predictor of language aptitude. Or perhaps, you 
reason, sex is not an overriding factor in language aptitude and you want to add 
,f more independent variables to the study and try again. As you can see. if your 
interest is in understanding language aptitude rather than trying to show sex dif¬ 
ferences, the strength of association test will help start you on your way to :m- 
J proving the research. 

Strength of association measures arc sometimes reported in correlational research 
pfl'; in our field. They are (to our knowledge) rarely included in reports in our field 
that give the results of the parametric and nonparametric statistical procedures 
discussed in this chapter. They are, however, extremely useful measures that 
should help you interpret your results. For example, if you ran a /-test procedure, 
found a statistical difference and claimed that group A did better than group B 
on the basis of the statistical procedure, you and your readers still do not know 
exactly how important that finding is. This is especially true when you have large 
numbers of 5s in the groups. (Remember that when you have more Ss, the de¬ 
grees of freedom go up and the t critical value needed to reject the H 0 gets 
I . smaller. Thus, it becomes easier to reject the H 0 .) If you do a strength of asso- 
{ V. ciation test, then you will know how important the variable is. If it turns out that 
1 you can show an association like 50% or better, you can make a big deal about 

; . your findings. If it turns out that it's, say, less than 10%, you might want to be 
more conservative and say that there are, of course, oilier factors that need to be 
highlighted in future research (research you have in mind to better answer your 
j . broad research question). 

ooooooooooooooooooooooooooooooooooooo 

Practice 9.3 

Do a strength of association test on the outcomes of the /-test problems in this 
chapter. Give the results and the interpretation in the space below. 

► I. German example (page 254) 

! • --- 

s - 

► 2. Keillor example (page 256) 

— 

— 


j^OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

diet's review once again some of the caveats regarding the use of the /-test proce¬ 
dure. 
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1. I he data lhal icpieseni the dependent vaiiable should Ik* ineusuied as inter- ; 
val scores. If used with or<Iinal sealed (rank-ordered) data, il is assumed the i 
scales approach interval measurement (so that X and s.d. are appropriate 
descriptive statistics). 

2. Lach .V (or observation) is assigned to one and only one group if you wish to 
use the regular Most formula. If the observations arc from the same ar 
matched groups, use the matched /-test formula. 

3. The scores are assumed to be normally distributed so that the A' and s.d. arc 
appropriate measures of central tendency and variability. 

4. The scores in the populations from which each sample was drawn are as¬ 
sumed to be normally distributed and the variances are assumed to be 
equivalent. 

5. Multiple comparisons cannot be done using the /-test procedure. (You cannot 
compare mean 1 with mean 2, mean 2 with mean 3, mean I with mean 3, and 
so forth.) 

Even when we observe all these warnings regarding the use of the /-test, we may 
run into trouble in interpreting the results. First, the results of the /-test procc-: 
dure tell us whether a difference exists. When we reject the null hypothesis, we 
feel confident in claiming that a difference exists. In many cases, however, we : 
cannot feel confident in saying what caused the difference. That confidence can 
only come from a well-designed study where all the threats to internal validity 
have been carefully controlled. 

When wc do evaluation research, research that investigates effectiveness of treat- : 
ments, we are advised to use random selection of 5s and random assignment of; 
5s to treatments. If we are careful in this process, we can assume that 5s are 
equally representative of the population and that they are equally represented in 
the two groups being compared. Wc believe the two groups arc truly the same: 
(except for the treatment). However, we often use intact groups of 5s-the stu¬ 
dents in our classes. If these 5s arc not randomly selected and randomly assigned: 
to the two groups, we must be extremely conservative in interpreting /-test results. 
We cannot generalize from the study although we can use the test for descriptive 
purposes if the data are normally distributed. Even so, any difference we dis¬ 
cover between the groups (no matter how large or small the / value may be) is still i 
suspect from a design standpoint. That is, we can be confident there is a differ¬ 
ence between the groups but we cannot be confident that there wasn't a differ¬ 
ence to begin with (there was no pretest in the design). 

To get around this problem, we can use a pretest. If wc use gain scores to try to 
avoid the problem of preexisting differences between the two groups, we still need 
to be careful and take into account the nature of the groups. For example, all 
teachers marvel at the progress low-level students can make in a school term and 
despair at the slow progress made by students at the upper-intermediate level. 
If you have taught a beginning-level language class and an advanced class, you 
know that students at the beginning level make the greatest observable gains. If 
wc use gain scores, the lower group will make the greater gain. (Smart business 
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people arc happy to promise money-back guarantees for beginners, hut few will 
offer such guarantees for advanced JS's!) When the treatment groups arc not 
! equivalent, the /-test procedure should not be used (there are other procedures 
available and we will discuss them later). 

In non-evaluativc research, research where two groups are compared, the same 
problem obtains. We have already noticed this in the German example (page 
1254). The data there are from an intact class. Students were not randomly se¬ 
lected or randomly assigned to the class. We may use the Mest for descriptive 
purposes, but we cannot generalize the findings to other classes. 

Finally, there are two related problems in interpreting results of /-tests in applied 
linguistics research which may not be so obvious in other fields. The first has to 
do with the assumptions of normal distribution and equal variances in the popu¬ 
lation. In the Keillor example on reading scores (page 256), the sample data came 
from 14 students. In running the analysis, you envision collecting scores from 
many different classes of 14 students. Is there reason to believe that as you con¬ 
tinue to collect the data, compute the A's and place them in the sampling distri¬ 
bution of means, that the distribution will be normal? I hat depends. Each year, 
the Los Angeles Unified School District publishes results of reading test scores 
by grade level. The distribution at each grade level is clearly bimndnl! Some 
ischools from certain school districts form a normal distribution curve near the top 
of the range for the means. Schools from other districts form a normal distri¬ 
bution curve near the bottom of the range. However, the overall curve is 
bimodal. Since information on the distribution of means in a population is not 
usually available (wc estimate it from the sample statistics), we need to be both 
icareful and realistic in deciding whether or not this particular assumption of the 
!Mest h is been met. 

I The second is more clearly an interpretation issue. We often compare the 
iperformance of a group of learners with that of a group of native speakers be¬ 
cause we truly do not know whether they will perform in similar ways. For ex¬ 
ample, we may not know whether learners will structure narratives using the 
same story components used by native speakers. We may not know whether 
learners and native speakers use the same amount of pause time between turns 
at talk, or whether they take the same amount of time before responding to 
questions. However, we can be sure that learners and native speakers will per¬ 
form differently on a language achievement test. It makes sense, then, to assume 
that when we compare native speakers and learners on some segment of 
language-say on grammaticality judgments of relative clauses-we will find 
highly significant differences. The differences, however, may have less to do with 
types of relative clauses than with general language proficiency. Large differences 
ican be expected in such research but the differences may be due as much to lan¬ 
guage proficiency as to the variable being tested. In any case, no causal claims 
can be made because it is not possible to randomly assign Ss to native or nonna¬ 
tive groups. Common sense, then, should guide our interpretation of the /-test 
iresults. At the very least, we should temper our claims—pointing out this problem 
in interpreting the results. 
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/-lest procedures art* extremely useful and powerful statistical tests fm comparing 
two means (assuming X is the best measure of central tendency for the data) 
They can be used when the means are independent (i.c., between-groups) and 
when they are paired (i.c., within pairs). As with all tests, care must be taken to ;i 
observe the requirements for the test. Prior to applying the procedure, be sure to i 
check each of these. And, after applying the procedure, interpret the findings • 
with care and common sense. 


Nonparametric Comparisons of Two Croups 

There arc times when data from two groups are to be compared, but the as-, 
sumptions of the Mest canmnbe met. Since the /-test compares Xs, it is impor¬ 
tant to make sure that the X and s.d. are the most appropriate measures to 
describe the distribution of the data in the two samples. You will remember that 1 
the larger the sample size, the more likely it is that a normal distribution wiil be i 
obtained. The Mest is especially designed for small sample sizes, so it is safe to 1 
go below a sample size of 30. However, it is important to check the normality ^ 
of the distribution. If there are "outliers" in the sample, the data are not normally ! 
distributed and the median rather than the mean is the best measure of central 1 
tendency. 

Another assumption that relates to normal distribution is that the data arc in- : 
terval scored. Nonetheless, you will often find that the /-test is used for ordinal i 
data and, worse yet, with frequency data. The rationale for using the procedure I 
with rank-order scales is that the data arc linear, that there is an underlying lin¬ 
earity in all continuous data which can approach interval measurement. Statis- i 
ticians differ in the advice they give novice researchers n this regard. If the 
measure used is truly continuous (the distance between a I and a 2 is approxi- ! 
mately the same as the distance between a 3_and a 4) and the data arc normally 
distributed throughout the scale so that the X and s arc the best measures for the 
data, then a parametric procedure is quite appropriate. Nonparametric tests also 
assume that there is some underlying continuity to scales and normality in the , 
distribution, but these assumptions are much weaker. When you are not sure, it ' 
is best to select a nonparametric procedure. 

The third assumption is that of a normal distribution in the population. The 
/-test requires that this be the case. If you have strong doubts, use nonparametric ; 
procedures (procedures which do not ask you to make estimates about the popu- i 
lation based on knowledge of the normal distribution). 


Selecting the Appropriate Nonparametric Test 

To select the appropriate procedure, the first question is whether the comparison : 
is between two independent groups or a comparison of the same .S's at two dif¬ 
ferent times. In this section, we present two nonparametric tests for comparisons : 
between two groups: the Median test and the Rank sums rest (also known as the ; 
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iVViicoxon Rank sums test, or Mann Whitney U). l ie comparisons for 
rcpcatctl-measurcs designs are given in chapter 10. 


Median Test 

The Median test is a very simple procedure for determining differences between 
data of two groups. If the data of the two groups have the same or similar me¬ 
dians. wc would expect that half of the 5s (or observations) in each of the groups 
would fall above and below the median of the combined groups. That is, wc ex¬ 
pect that if wc find the median for all the 5s or all the observations, there would 
be as many from group 1 as from group 2 above the median and as many from 
each group below the median. 

Once we have computed the median, wc can set up a contingency table for the 
data like this: 

Sample Contingency Tabic 

Position to Median Gp. I Gp. 2 Total 

Above Med. A B A + B 

Below Med. C D C + D 

Total A +C = n } B + D = n 2 N = n i + n 2 

If the null hypothesis is true, then half the observations in the first group should 
be in A and half in C; half those in the second group should be in B and half in 
D. If the actual frequencies arc quite different from this expectation, then we can 
reject the null hypothesis. 

The formula for the Median test is: 

(.-1 -T- ff|) — (iv n 2 ) 

sjp{\ - />X 1 : «| + 1 n 2 ) 

where p = (A + B)-v N 

Let's see houv the formula works, hirst of all, it uses the values of A and B-the 
number of 5s or observations in each group that are above the median. However, 
it also takes into account the total number of 5s or observations in each group. 
The top part of the formula, as always, is the easy part. It checks to see if, in¬ 
deed. half the observations (or 5s) in each group arc above the median. 

We do expect the groups will be different and the numerator will have a value 
othei than 0. How large does that value need to be before we can say the dif¬ 
ference is an important one? It seems that we should just be able to divide by the 
square root of the number of scores above the median divided by N. That's what 
the classy symbol p represents. However, the denominator is adjusted to take 
into account the number of 5s (or observations) in each group as well. 

Here is a data set that can be used for this analysis. 


Chapter 9. Comparing Two Groups: Between-Groups Designs 271 



Foreign Ss 


Immigrant Ss 

25 

13 

9 

46 

31 

43 

25 

30 

17 

20 

21 

42 

17 

20 

37 

25 

38 

30 

26 

23 

20 

17 

19 

20 

18 

26 

11 

36 

38 

29 

30 

12 

32 

54 

41 

13 

24 

20 

16 

8 

68 

32 

21 

37 

31 

26 

28 

30 


The above data arc scores achieved by a group of foreign students and a group- 
of immigrant students on a three-part test. The total scores (out of a possible 70! 
points) are given above. Note that there arc some extreme scores (the X is not* 
an appropriate measure of central tendency). 

The first task, then, is to count the number of observations above and below the 1 
median for each group and place that information into a contingency table. 

Foreign Immigrant Total 
Above 12 12 24 

Below 20 4 24 

Total 32 16 48 

Now we can compute the value of p. 

£=(12+ 12)+ 48 = .50 

and substitute all this information into the formula: 


r _ (12 ^ 32) — (12 -r 16) 

V(.5XI-.5XI+32+l + 16) 

T= -2.45 

Next we need to be able to place the T value in an appropriate distribution tc 
determine whether or not we can reject the null hypothesis. Fortunately for us r 
the T value corresponds to the z values presented in the z table in the appendix. 
We already know how to read this table. However, we don't really need to refer 
to the table anymore because we know that We need a z score of 1.96 or better to 
reject the null hypothesis at the .05 level. If we set the a at .01, we need to meet 
or exceed a z value of 2.33. 

Assume vve set the a at .05. The critical value of * for .05 is > 1.96. We can, 
therefore, reject the null hypothesis and conclude that for these data there is a 
relationship between groups and performance on this test. There is a significant 
difference between the two groups (the immigrant group outperformed the for¬ 
eign student group). 
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You might have wondered how to classify a score which is at the median. In the 
data for this example, we managed to ignore this problem. If you have a very 
large data set, you can safely discard scores at the median. However, in applied 
; linguistics studies we seldom have such large numbers of subjects or observations 
that this is a safe procedure. A second solution is to compare all scores above the 
median with those not above the median. In this case, the scores on the median 
a re combined with those below the median. A safer solution would be to run the 
procedure twice, once grouping these cases with those above the median and once 
grouping them with those below the median. Hopefully, the results will be so 
similar that you will find it makes no real difference in the results. If there is a 
large difference, see your statistical consultant. 

As you have seen, we have lost information in the process of applying the Median 
| test, to the data. The actual value of each score has played a role in determining 
the median. Once that is done, each score is either above or below the median. 
We do not care how far above or below the median each observation is from this 
■ point onward. This differs from the process used in a /-test where we used our 
"ruler" to check for distance from the mean. 


looooooooooooooooooooooooooooooooooooo 

Practice 9.4 

I. Look back at the data examples in the section on the /-test procedure. Which 
do you feel may have violated the assumption of normal distribution (i.e.. .V and 
s.d. arc not appropriate measures of central tendency and dispersion)? 


2. We would prefer not to lose any information that exists in our data. Yet, at 
the same time, we must observe the assumptions underlying any statistical pro¬ 
cedure. If this were your data, which procedure-Median test or f-test—would 
you select? Why? 


>3. Prior to instruction, 5s wishing to enroll in an oral skills class were inter¬ 
viewed. The 5 received ratings on three 5-potnt scales: pronunciation, fluency, 
and grammar. On the basis of their scores, 5s were assigned to an intermediate 
or an advanced oral skills class. Here are the data, the sum of the three 5-point 
scales for each S: 
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Assignment 


interm 

Adv 

12 

15 

10 

10 

10 

15 

11 

15 

8 

15 

7 

12 

15 

10 

14 

12 

9 

12 

10 

14 

10 

9 


12 


Median = 12 

Do a Median test. Show your calculations in the space provided below: 


Can you reject the null hypothesis? Or is there really no meaningful difference 
in the ratings of the two groups of 5s? 

-- 

-- 

ooooooooooooooooooooooooooooooooooooo 

Rank Sums Tests (Wilcoxon Rank Sums, Mann Whitney V) 

The Wilcoxon Rank sums test and the Mann Whitney U are actually the same 
test. The test compares two groups on the basis of their ranks above and below 
the median. The Rank sums test is often used when ordinal rating scales rather 
than interval type scores are to be used. The researcher may not be confident 
that the scales are strongly interval or that X is the best measure of central 
tendency, but is certain that each 5 (or observation) can be ranked in respect to 
other 5s (or observations). In such cases, the Rank sums test, rather than :hc 
t test, should be selected to compare the two groups. 

Imagine that you have developed a set of materials to teach a unit on description 
(one of the course objectives for your low-intermediate RSI. composition class). 
Your class has 21 students and you have randomly assigned 11 5s to use the new 
materials and 10 5s to use the old materials for this unit. While you realize that 
you cannot generalize the results (due to sampling problems), you still wish to 
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knew whether the 5s using your new materials outperform those in the control 
group when asked to write descriptive compositions. A panel of judges (who are 
unaware of the treatment) assign scores to each S s descriptive composition. The 
iudges were guided in their scoring procedure by a checklist, but (even after being 
trained to use this checklist) they seemed unhappy with their ability to score each 
[person accurately. However, they agree that their combined scores do allow you 
to rank-order the 5s. The best 5 is ranked I, second best 2, and so on. Students 
with the same score are given a tie rank. The question is whether 5s in the ex¬ 
perimental group will place closer to the top than those in the control group. The 
design follows. 


Research hypothesis? There is no difference in the ranks assigned 
to the compositions of the two groups. 

Significance level? .05 

/- or 2-tailed? Always 2-tailed for this procedure 
Design 

Dependent variable(s)? (Descriptive) composition rating 
Measurement? Rank-order scale 
Independent variable(s)? Group 
Measurement? Nominal (experimental vs. control) 

Independent or repeated-measures? Independent 
Other features? Subdivided intact class 
Statistical procedure? Rank sums test 


Box diagram: 

l-XPIR CONTROL 


The data are arranged below. Remember that a low number equals a high rank. 
|Ties are assigned to the same rank. However, be careful. A tie for third place 
gives each person a rank of 3.5, and the next person receives a rank of 5. 
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Rank. 

Group 

Exper. 

Control 

1 

NM 

1 


2 

NM 

2 


3.5 

NM 

3.5 


3.5 

NM 

3.5 


5 

OM 


5 

6 

NM 

6 


7 

NM 

7 


8 

NM 

8 


9 

OM 


9 

10 

NM 

10 


11.5 

NM 

11.5 


11.5 

NM 

11.5 


13 

OM 


13 

14 

OM 


14 

15 

OM 


15 

16 

OM 


16 

17 

OM 


17 

18 

NM 

18 


19 

OM 


19 

20 

OM 


20 

21 

OM 


21 



T1 -82 

T2= 149 


From Tie rankings, it certainly looks as though the group with the new materials 
outperformed the control group since more new-material .Vs ranked high, lo find' 
out how much confidence we can have in claiming a true difference, we will apply 
a special 2 formula for rank sums. The test, as you can see from the way the data; 
are set up, compares the two groups but also considers the spread of the ranks, 
within each group. 

Here is the formula: 


27*i • n i (N+ I) 

y(ffiX«2>i A ' + h 

The numbers 3 and 2 in the formula are constants. That is, they do not come 1 
from the data. The formula asks how likely the total ranks for the 5s in each 
group might be (given the number in each group and the total N for the study). : 
The obtained z value is then placed in the 2 distribution to tell us whether or not 
we can reject the null hypothesis. 

Let's carry out the calculations by placing the values in the formula. The n for 
the experimental group is 11. That of the control group is 10. The total /V for 
the study is 21. The total for the ranks given the experimental group (T,) is 82. 
(The total for the ranks of the control group (7’ 2 ) is 149, bu: this information is 
not needed in the formula.) 
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let's i.arry nut the calculations for these data using the formula. 


27’ l -#i 1 (Af+ 1) 
^ /(kiX« 2 XAt+ D~ 

2(82)- 11(22) 

/ 11 x 10 x 22~ 




ow we can turn to table 1 in appendix C for the distribution of z scores. You 
?ad the table in exactly the same way that you did before. However, by now, 
you may remember the value you need to keep in your hcad-thc z critical for x 
f .05: 1.96. Knowing this, we can reject the null hypothesis. 




otice that the obtained z score is a negative value. This is because the highest 
nking is 1. the lowest number, and not because the experimental group did 
prse than the control! (Many a researcher has come close to shock having for- 
ijtten for an instant that the highest rank is the lowest number. For this reason 
searchers who carry out this procedure by hand sometimes reverse the order of 
e numbers assigned to ranks before carrying out the procedure.) We can reject 
he null hypothesis and should have confidence in the conclusion that the 5s in 
he class who used the experimental materials outperformed those who used the 
materials. However, we cannot claim that these same results might obtain for 
ither learners in other classrooms since the design (particularly the use of intact 
ilasses) does not allow us to generalize. The procedure is used here for descrip¬ 
tive, rather than inferential, purposes. 


•oooooooooooooooooooooooooooooooooooo 


jff 1. When you carry out any statistical test by hand that works with ranks, it's 
important to be accurate in assigning ranks to the data. If two scores that would 
be ranks 2 and 3 are tied, they are each given the rank of 2.5. The next available 
rank is 4. The following table may help you in assigning ranks: 
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Assigning Ranks to Tied Scores 


Scores Ranks 
2 1 

6 2 

8 3 

10 4.5 

10 4.5 

11 6 

12 8 

12 8 

12 8 

16 10 

22 11 


average ranks 4 and 5 = 4.5 for each 
average ranks 7, 8, and 9 = 8 for each 

CHECK: The final rank corresponds to N 


CHECK: Count the total number of scores (AO- Be sure that the highest rank 
equals N (unless there is a tic for the highest rank). If it doesn't, there is an error 
(most likely in the assignment of tied ranks). 


Assign ranks to the following data: 


Score Rank 
12 
13 

13 

14 

15 

16 


17 

17 

18 
18 

19 

20 

Check: A f = _ Top rank = _ 

► 2. In an ESL reading class, Ss were given a cloze test (where they must fill in 
every «th word in a reading passage). The students are from two different first- 
language groups and so you wonder whether there is a difference in the cloze 
scores of the two groups. Here are the data: 


Gp. 2 

Gp. 1 

18 

22 

13 

20 

16 

21 

15 

19 

14 

16 

21 

24 

20 

23 


13 
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(Calculate the 2 value. 


2T l -n l (N+ 1) 


I 


(«iX« 2 X^ + 0 


; Can you reject the null hypothesis? 


What conclusion can you draw?_ 


3 . The test gives us confidence in interpreting the descriptive statistics. If we 
wanted :o replicate the study so that the findings could be generalized (using 
parametric inferential statistics), how might we redesign the study? 


4 . Look back through the examples given for the /-test procedure. Would you 
recommend the use of the Rank sums test for any of the examples? If so, why? 


: 5. Is the Rank sums test one that you might use for any of the research question*; 
■ you have defined for your own work? If so, why? What other possible tests 
might you use? 


000000 < 000 < 000000000000000000000<0000000 

Strength of Association: eta 2 

There is also an rj 2 formula for the Rank sums test. It can be used with any 
statistical test that yields a z score. Since the T in the Median test equals z, this 
formula is appropriate for it as well. 


N- 1 


Again, all the information that you need to do a strength of association test is 
available in your printout or from your calculations. Assume that we had 18 5s 
classified as exceptional language learners and 12 classified as poor language 
learners. Perhaps you asked teachers to nominate the best learners they have ever 
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Ji.kI mnl the learners who had the* most difficulty. Ideally, this list of nominations 
would form a pool from which you might draw a random sample of 30 each. 
However, let 's assume you couldn't do this but did find 18 exceptional and 12 net 
so exceptional learners. Your research'question is whether good vs. poor learners 
show differences in auditory short-term memory. After testing each person, you 
use their scores to rank-order ail the learners and then perform a Rank sums test 
which gives you a z statistic of t 3.8. Good language learners did significantly 
better than the poor language learners on the test of auditory short-term memory 
Now, you wonder how much of the variability in the short-term memory scores; 
from al. the 5s can be associated with being a "good" vs. "poor " learner. 


(Notice, this time, that N [the total number of 5s] rather than df is used.) The 
strength of association found in this case is r/ 2 = .498. Thus, you can say that 
49.8% of the variance in the ranks of auditory short-term memory ability mav • 
be attributed to learner classification type. Again, one independent variable has ; 
accounted for much of the variability in short-term memory (STM). Yet, there 
is still variance left to be explained. At this point, that variance is all "error"—tha: I 
is, attributable to something other than the independent variable you have chosen i 
to study. 

This strength test is for the association in your sample (i.e., not in the population). 
It should make you begin to wonder just what the components of such classi¬ 
fications as "good learners" and "poor learners" might be Certainly the outcome 
makes it look like a good short-term auditory memory might be one of these 
components. Of course, we are reasoning in a backward direction this time. But ; 
by looking at the strength of association, we should begin to form our ideas about 
how to improve the research, what other things we might want to enter into the 
equation in our future research on this question. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 9.6 

Do a strength of association test on the Rank sums problems in this chapter. 
How strong an association is there for the levels of the variable? Report the re- 
suits below. 


► !. Composition study (page 275) 
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yl. Ph.lippines stud y (P a S c 28 0 


iOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


Deciding Among Procedures 

The Wilcoxon Rank sums test was used by Hensley and Hatch (1986) in a pre¬ 
liminary analysis of data on the language lab listening comprehension program 
in a refugee camp on Bataan, the Philippines. Scores on a set of ratings (ratings 
of pronunciation, syntax, confidence, communication ability, etc.) were given to 
individual 5s (based on their performance during an informal conversation ex¬ 
change). These ratings were summed. We could not assume that the data, most 
of which were 5-point scales, showed interval continuity. The data at this pre¬ 
liminary point in the analysis were from 29 Ss in the experimental (language lab) 

!group and 32 5s in the control group. It was probably not safe to assume the 
data were normally distributed. The data were run on the computer using pro¬ 
cedures for comparing two groups in the SAS statistical computing package. The 
printout reported the means of the two groups as 37.69 for the experimental 
group anti 24.94 for the control group, and it gave us a Mest approximation of 
probability at .006. Then it showed us the ■/. value as 2.8009 and the probability 
as .005 on the Wilcoxon Rank sums test. 

Though less of the available information in the data was used by the Wilcoxon, 
the result of the Rank sums test was not very different from that of the /-test. 
Ifherefore, we had confidence in the difference found between the two groups in 
the preliminary data set. 

While running the Rank, sums test on SAS, it was also possible to request another 
analysis—the Median test. You'll remember that when we have extreme scores, 
the median is the more appropriate estimate of central tendency. Since we had 
not yet plotted the data, it seemed wise to run this analysis in case we had 
"outliers," 5s with extreme scores in either group. 

The Median test, you'll recall, counts the number of scores above vs. below the 
median for each group. If the two groups were the same, there should be a sim¬ 
ilar number of people above and below the median in each group. The frequen¬ 
cies (number above and below the median) for each group are then compared 
With these expected frequencies, and a z value is obtained. In this case the z value 
Iwas 1.684 and the probability was .0922. Should this lessen our confidence? 

fliow can we decide which of these three tests—the Wilcoxon Rank sums, the Mest 
approximation, or the Median test-gives us the most accurate information? If 
select the Median test, we cannot say the two groups differed. Both the Rank 
Sums test and the Mest say the groups did differ. 


Chapter 9. Comparing Two Groups: Between-Groups Designs 281 


Think for a moment of our earlier discussion of power. The most powerful test 
is the test which uses the most information and is the least likely to lead us into 
error in claiming a difference exists when none does or in claiming that no difi 
ference exists when, in fact, there is a difference in the groups. The Median test 
uses only information related to position above or below the median. The Ran| 
sums procedure uses not only a score's relation to the median but its relative 
distance from the median as well. The Mest does the same, except that it uses 
the X and a "ruler" for distance from the X. 

The most powerful test, then, is the /-test, but we argued that the data did not 
meet the assumptions of the /-test. The second most powerful test was the Rank 
sums test. The confidence it gave us in our claims that the two groups differed 
was as great as that from the /-test approximation. Since we had a fairly su# 
stantial number of 5s at this point, the power of the two tests was approximate!^ 
equal. We therefore used the Rank sums test to give us confidence in our inters 
pretation of the findings. 

To review, wc have presented two nonparametric "equivalents" to the /-test. 

1. The Median test has the least power since it throws away much 
information. It tallies the number of 5s (or observations) above 
and below the median for each group, compares this with the 
expected frequencies for each group, and calculates a T score 
which is evaluated like a z score. (It docs not measure distance 
from the median.) 

2. The Rank sums test rank-orders 5s or observations and then 
locates those from each group in the ranking. Both groups 
should be equally distributed in the ranking if there is no dif¬ 
ference between the two groups. It uses information about the 
actual rank of each 5 or observation in drawing the comparison 
between groups. With large sample sizes, it approaches the 
power of the /-test. 

3. Both tests compare the values of two groups, and check the 
difference between the groups for statistical significance. The 
parallel parametric test is the /-:est. 

4. There are other nonparametric tests that compare two groups. 

You might, for example, encounter the Kolmogorov-Smirnov 
test in reports. This test counts the number of 5s (or responses) 
at each point on a scale (for example, a 5-point scale), and uses 
the information on the cumulative differences between two 
groups in the analysis. Thus, it uses more information than the 
Median test. However, it requires an n of 40 in each group. 

For a full range of nonparametric tests equivalent to the /-test, 
please consult statistical compuiing packages (SAS, SPSS-X) or 
Siegel (1956). 

In this chapter wc have discussed methods for testing the differences between two 
groups. While you may never need to calculate any of these procedures by hand. 
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it is important to know how they test differences and the assumptions behind the 
procedures. This understanding will allow you to select the most appropriate 
procedure for your own research. It will also help you to evaluate the research 
reports you read. In the next chapter we will discuss statistical procedures that 
allow us to make these comparisons for repeated-measures designs. 

Activities 

Read each of the following summaries. Decide whether the requirements of a 
f-test procedure were met. Decide whether a nonparamctric test would be ap¬ 
propriate. Determine which statistical procedure you would use for each study. 
If you feel that none of the procedures we have described so far is appropriate, 
explain why. 

:j. J, D. Brown (1988. Components of engineering-English reading ability. 
SYSTEM. 16, 2, 193-200) describes the evolution of a reading test on engineering 
; English. Among the research questions asked about the test were (a) which types 
of items (engineering content, linguistic content) best distinguish the ability to 
read engineering English, and (b) to what extent could ability to read engineering 
English be accounted for by general English proficiency. One of the first issues 
to be answered, however, was whether there was any difference between EEL 
|students who were or were not engineers in their ability to read engineerirg- 
: English texts. 

'Eifty-eight Chinese students of English participated in the study: 29 were engi¬ 
neers studying at Guangzhou English Language Center and 29 were TEEI. 
trainees studying in the Foreign Language Department ofZhongshan University. 

: Here are the descriptive statistics comparing the reading test scores of I LIT. and 
! engineering .S‘s: 



Engineers 

TEFL 

Mean 

36.97 

27.38 

range 

17-54 

17-43 

s.d. 

8.22 

6.63 

SEM 

3.39 

3.57 

n 

29 

29 


The observed value of t was 4.89. The research hypothesis was two-tailed. In¬ 
terpret the table and state the conclusions that can be drawn. Then answer the 
questions posed in the activity instructions. 

2. M. A. Snow & D. Brinton (1988. Content-based language instruction: inves¬ 
tigating the effectiveness of the adjunct model. TESOL Quarterly, 22, 4, 553-574) 
report on a variety of measures evaluating the adjunct model (where English in¬ 
struction is tailored to an academic content course). One question asked was 
whether students enrolled in such classes were better prepared to meet the aca¬ 
demic demands of university content classes. A special exam was given which 
required .Vs to both read and listen to content material (taking notes) and then 
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(a) answer a series of true false ami multiple-choice questions on content, and (b) 
write an essay which required synthesis of content information. The experimenta 
group were 12 students enrolled in an adjunct class and the control group con¬ 
sisted of 15 students in a regular ESL class. Each group also took the ESLPE 
an English language exam. Mere are the findings in table form: 


I 


Exam 

Exper. 

Control 

z 

P 

Objective test M 

25.4 

26.1 

0.63 

n.s. 

s.d. 

5.4 

7.8 



Essay M 

66.5 

67.2 

0.42 

n.s. 

s.d. 

13.3 

16.3 



ESLPE M 

90.8 

99.4 

2.11 

.05 

s.d. 

11.0 

13.5 




The Wilcoxon Rank sums test was used to test the comparisons. Interpret the ; 
above table and state your conclusions. Then answer the questions given in the : 
directions to this activity section. 


3. E. Fold (1984. 1 lie influence of speech variety on teachers' evaluation of 
students with comparable academic ability. TESOL Quarterly, IS, I, 25-40.) 
presented 40 teachers with speech and writing samples of children and asked 
them to evaluate each child. Prior to the study, the written samples had been ; 
rated as equal. Half of the oral English samples exhibited features of Spanish 
phonology; the other half did not. After reading and listening to each sample, the 
40 teachers assigned it a rating on a set of semantic differential scales related to 
potential for academic success. The results showed lower semantic differential 
ratings for the samples in the Spanish-accent group. The conclusion was that : 
teachers held lower expectations for students who spoke the Spanish-accented 
English. 


4. A. Ramirez & R. Milk (1986. Notions of grammaticality among teachers of 
bilingual pupils. TESOL Quarterly, 20, 3, 495-513.) asked teachers attending 
summer institutes to evaluate and rank four varieties of spoken English and spo¬ 
ken Spanish (standard English/Spanish, hispanicized English/Spanish with 


phonological and morphological deviations from standard, ungrammatical 
English/Spanish, and English/Spanish code switching) for classroom appropri-l 
ateness, grammatical correctness, and speaker's academic potential. Evaluations; 
of both the English and Spanish language samples appear to have been made on 
a standard language continuum, with code-switching the least acceptable and 
correct. 

A table presented in the study states that "All /-tests for pair comparisons are 
significant at the .05 level with the exception of those indicated with a bracket.'' 
Four varieties of Spanish and four of English are then compared for appropri-: 
ateness, for correctness, for speaker's academic potential, and for a total global 
reaction (the sum of the means of the other three types of ratings). 
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5 . F. Mangubhai (1986. The literate bias of classrooms: helping the ESL learner. 
TESL Canada Journal , Special Issue I, 43-54.) conducted a national survey in 
Fiji that showed sixth-grade students were unable to read the simple English 
prose required for successful schooling. Students in high-level achievement 
^ .'groups invariably came from schools with large libraries. This study evaluated a 
I project called Book Flood where selected schools in rural areas of Fiji received 
Iff- 250 books. Programs where fourth- and fifth-grade students read the books si- 
I!' lently ar.d programs with a "shared-book" treatment (teacher introduces the book 
IH and reads all or portions of the book aloud to children as pictures, or a 
1 % "blown-up" version of the text is projected and then students are encouraged to 
f | join in the reading and discuss the book together) were compared to control 
|¥ schools which did not participate in Book Flood. The n sizes for the grade 4 
classes were: shared-book = 71, silent reading = 84, and control = 65. For 
i grade 5. the n sizes were: shared-book = 89, silent reading = 86, and control = 

| 87 - 

I After eight months, all the classes were tested. The Book Flood schools were 
IfJ compared with the controls using Mests. The measures for the fourth grade were 
Mi reading comprehension tests, English structures, word recognition, and oral sen- 
ill tence repetition. The fifth- grade students were tested on reading comprehension, 
|Jf listening comprehension, English structures, and composition. Mests were again 
H§£ used to compare the Book Flood schools with the controls. The program was 
m. continued into the next year,, following the same 5s. The gains that the Book 
|H Flood groups made in comparison to the control group continued into the second 
te year. No significant differences were found between the shared-book group and 
tt'. the silent reading group. 

jp| 6 . D. Cross (1988. Selection, setting and streaming in language teaching. SYS- 
fjp TEM, 16, 1, 13-22) presents a series of Mest comparisons of "upper set" and 
fc Tower set" groups of students. (These terms relate to ability level and were made 
|g| prior to the research when children first entered the program.) The two groups 
) were compared on various measures such as reading, cloze, fluency, listening to 
| Sf numbers, attitudes, etc. The null hypothesis could be rejected in each case, with 
the upper set outperforming the lower set. By checking through the scores of in- 
| dividuals within each group, it was possible to locate students who had beer, 
f misplaced and reassign them to a more appropriate group. In addition, Cross 
f checked to see if students in both the "upper set" and in the "lower set" improved 
v in language skills over time. Tables show gain or loss for each group as mean 
I score gain or loss (from pretest to posttest). The tables also give a percent change 
f figure for these gains or losses. No statistical procedures were used to analyze the 
:■ § data in these particular tables. The author concludes that both groups did show 
% improvement. 

| 
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Chapter 10 

Comparing Two Groups: 
Repeated-Measures 


• Parametric comparisons: Matched t-test 

Strength of association: eta 2 

• Nonparametric comparisons: repeated-measures 

Sign test 

Wilcoxon Matched-pairs signed-ranks test 
Strength of association: eta 2 

• Deciding among procedures 

• Advantages!disadvantages of nonparametric procedures 


Parametric Comparisons: Matched t-test 

In the Case 1 and Case 2 Mests that we have discussed so far, the means have 
always come from two different groups. In Case 1 studies, the group mean (AT) 
is compared with a population mean (p). In Case 2 studies, two group means arc 
compared. These are all between-groups designs. 

Repeated-measures designs, where the comparison is within one group, are quite 
common too. In such designs, the means are from the same group of 5s. For 
example, we may wish to compare the performance of a group of 5s prior to in¬ 
struction and after the instruction. The scores are from the same 5s at two dif¬ 
ferent, times. It's also possible that we might want to compare the performance 
of a group of 5s on two different measures at one period in time. Again, the de¬ 
sign is a repeated-measures design. 

It is also possible to compare paired data of another sort. For example, you 
might have a pool of 5s who have all taken a general proficiency examination. 
From that pool you might select 30 pairs of 5s who have the same score on the 
examination. One 5 from each pair is randomly assigned to group l and the 
other to group 2. The pairs of 5s are matched on the basis of language profi¬ 
ciency. One group becomes the experimental group and the other the control 
group. The performance of the tw r o groups can then be compared. (Matching is 
not a substitute for random assignment, but can supplement it.) 

When we compare the performance of the same 5s, or of matched 5s, we must 
change the r-test formula slightly. This is because we expect that the performance 
of the same person (or of two 5s who have been matched) on two measures will 
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he closer than the performance of two different people on the two measures. That 
is, the scores for the paired means are not independent. Therefore, the formula 
must be changed slightly to take this into account. 


The basic difference in the formula is that our .V, now, will be for number of pairs 
rather than number of 5s or observations. The standard error of difference be- 
tween means will be computed by dividing by the number of pairs minus one (df 
for pairs) rather than number of observations minus I. So, df = n patrs - I. 

Imagine that the data collected by Lazaraton on a listening comprehension test 
prior to a special instruction unit and after the unit looked like this: 


prior to a special instruction unit and after the unit looked like this: 

Scores on listening comprehension 


5 

Pretest 

Posttest 

D 

D 2 

1 

21 

33 

12 

144 

2 

17 

17 

0 

0 

3 

22 

30 

8 

64 

4 

13 

23 

10 

100 

5 

33 

36 

3 

9 

6 

20 

25 

5 

25 

7 

19 

21 

2 

4 

8 

14 

19 

5 

25 

9 

20 

19 

-1 

1 

10 

31 

35 

4 

16 


ZX = 210 

ZY =258 

ZD = 48 

ZD 2 = 388 


X,=21 

Y 2 = 25.8 



The research 

question asks 

whether 5s 

improved following instri 


jump of 4.8 points in the mean a significant change? (The 4.8 is obtained by : 
subtracting X x from X 2 or by dividing A' by ££>.) As you can sec, the first step 
is to see what difference there is between each pair of scores. The answer to this : 
is at the bottom of the column labeled D (difference): ][£>. To guard against 
negative values, each difference value is squared and placed in the column labeled 1 
D 2 . The column is added, and the sum (£D 2 ) is placed at the bottom of the 
column. We can now plug the values into the paired f-test formula. 


As always, the top half of the formula is the easy part. It gives us the difference 
between the two means. And, as always, the bottom half of the formula is a type 
of "ruler" that measures off the distance in terms of the standard error of differ- ; 
ences between two means. The standard error of differences between the two 
means, however, is adjusted to account for the fact that the means are from the 
same 5 (or from matched pairs). To remind you that we are using matched pairs, : 
we use the symbol s^. 
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The formula for is as follows: 


S D 


To find s D use the following formula: 


s D = 


IZD 2 -{l+nXZD? 


n- 1 


In the matched t formula, the n is the number of pairs (not the n for scores). 
Thus, the standard deviation of the differences is adjusted for the number of 
pairs. 

If this is clear, let's put the values into the formula: 

S D = 


I 388 -(1 -h 10X48 Z ) 


10 - 1 


S D~J- 


157.6 

9 


" 


* 0 = 4.18 

Now we can calculate s^, the standard error of differences between the two 
means, in order to obtain our "ruler." 

._4.18 

D ViF 

5k= 1.323 


Now that we have our "ruler," we can check, the difference between the two 
means to find our observed t value. The formula for this should be familiar. 


*obs 


l obs 


'D 

21 — 25.£ 


.323 
t obs = -3.63 
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As you will have guessed, our next step is :o check this observed / value against 
the / critical value in the / distribution tablc--tablc 2 in appendix C. We use the 
same /-test table as before (the changes in the formula have adjusted the observed 
/ value to compensate for the paired nature of the data). 

There is a small difference in terms of degrees of freedom. In the regular /-test 
we calculated the df as /t, — I + n 2 - 1 . In the Matched /-test, we work with 
pairs. The df is the number of pairs - I. Since there arc 10 pairs in this studv : 
df ~ 9. We need to check the value to sec whether the difference is significant! 
at the .05 level. The critical value for / is 2.262. We can reject the null hypothesis 
because our value of 3.63 exceeds 2.262. We can have confidence in concluding; 
that treatment does have an effect on performance in these data. Student scores: 
differ significantly from pretest to posttest. Again, the information may be sum¬ 
marized in a simple statement: t = 2.262, df = 9, p < .05. 

On the basis of the findings, we can reject the null hypothesis of no difference 
between pretest and posttest. The (hypothetical) class did improve. The /-test 
gives us confidence that the difference is real in these data. That is, the test is i 
being used for descriptive purposes only. Why can't wc use the test for inferential 
purposes? The reason is that there are many threats to the external validity of 
the study. The students aie mil randomly selected or randomly assigned. It s 
an intact class. So, even though the /-test allows us to generalize, the design of 
the study precludes such generalization. 

1 or descriptive purposes, the results are also open to question. We have claimed 
that treatment has an effect on scores. I he students performed significantly; 
better on the posttest. However, the design has so many threats to internal va¬ 
lidity that we need to temper any claims. We cannot be sure that it was primarily 
the treatment that caused the change. Improvement might be due to many dif¬ 
ferent things (including, of course, the instruction). 

In this example, we used two scores from the same group of 5s—a repeated-: 
measures comparison of means. Paired /-test data could also be obtained from 
pairs of 5s who have been carefully matched (so that we can assume that they 
will perform similarly, all things being equal). In this example they might be 
matched according to the pretest results, on the basis of a hearing test, by first 
language, by sex, or other relevant characteristics. The calculations are the same 
whether the comparison is for the same group or matched-pairs-in either case: 
we expect the means (Xs) to be closer than they would be if two groups of ran- ; 
domly selected 5s were compared. The Matched /-test is designed to deal with, 
this assumption. What would happen if you used a regular /-test for a repeated- 
measures design? Think about it for a moment. Differences that might turn out 
to be statistically significant on a matched-pairs test might only be a "trend" (and 
statistically not significant) with a regular /-test. This is because we expect 
smaller differences in the performance of the same 5 on two measures. The 
Matched /-test has been designed with this fact in mind. 
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ooooooooooooooooooooooooooooooooooooo 


practice 10.1 

1 . The data we worked with are hypothetical data. Here is the actual table used 
by Lazaraton to report her results: 

Matched t-test on Gains after Treatment 


Group n 

Pretest 38 

Posttest 38 


Mean 

17.3 

19.4 


s.d. 

5.1 

4.9 


t value 
2.81 


df 

37 


P 

.008 


|ive your interpretation of the table: 


► 2. Imagine that you have been given reading test scores of two carefully 
fiptched groups of university students. These students are studying electrical 
engineering at the University of Guadalajara and, as part of that training, have 
a course in technical English. 5s have been matched for English proficiency and 
reading subtest scores and one member of each pair is assigned to group 1 and 
the other to group 2. Group 1, by a toss of a coin, becomes the experimental 
§|fpup and group 2 the control. The experimental group receives an intensive 
reading program that stresses top-down reading strategies (using publications and 
papers from chemical engineering). The control group receives a four-skills tech¬ 
nical English course. At the end of the course, all students are tested again and 
gain scores on the reading subtest are computed for each student. As you might 
imagine, not all students appear for the posttest but 10 matched pairs are still 
there. Their gain scores must now be compared as matched groups. Here are the 
(fictitious) data: 
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Gain Scores on Reading Subtest 
Gain scores 


Pairs 

A 

B 

C 

D 

E 

F 

G 

H 

I 

J 


Experim. 

8 

7 

5 
3 
1 

6 

8 
9 
5 
5 

IX =_ 

n = 10 
AT = 5.7 


Control 

6 

5 

5 

4 

4 
2 

5 

3 

4 
3 

IX =_ 

«= 10 
A'=4.1 


ID = 


ID 2 = 


The means of the two groups arc closer together than those in the previous ex¬ 
ample. but you cannot conclude from this that the difference in the two groups 
is or is not statistically significant. Other information must be taken into account. 

Before you can tell whether the difference between the means of the two groups ; 
is significant, you must use a "ruler" to measure off the placement of the differ- i 
ence in the t distribution. Once again, you will find the standard error of differ¬ 
ences between two means and then adjust it for pairs. 


S D 




n- I 


Remember that n means the number of pairs, so with 10 pairs the denominator : 
is 9. 


S D ~ 


Complete the computation. s D : 


/ -0 + 

) 2 

V 10-1 



To find the standard error of differences between pairs of means (5^), divide by 
the number of pairs: 

S D . 


SO, Sj} ~ 
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;;js|ow compute the t value: 


<obs=~ 


l obs ~ 




The t value = _ 

critical value of t for .05? _ 


How many df are there? _ 


What is the 


g£an the null hypothesis be rejected? Why (not)? 


What can you conclude? _ 


Does the design allow you to generalize the results? Why (not)?_ 


|3. Whenever we analyze interval or ordinal scale data, we always look first at the 
fmean (or other measure of central tendency) and the 5 . When we want to com- 
Ipare two groups, our first reaction on seeing means which are very similar is to 
assume that there is no significant difference between the two groups. When we 
i'see fairly large differences between two means, we assume there is a true differ- 
|ence between the groups. Explain why this may not turn out to be the case. 


| OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

ii 

Strength of Association: eta 2 

While the Matched t-test allows us to test the difference in the two means of a 
group, it does not tell us how "important" this difference is. We can check this 
for any particular data set by applying a strength of association formula-eta 2 . 
The formula is exactly the same as that presented in the last chapter for the 
between-groups r-test. 


t 2 + df 


Let's apply this to the pretest-posttest data on page 288. The observed value of 
t was —3.63 and there were 10 pairs of Ss (so the df is 9). 
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2 

n = - 


3.63 


3.63 z + 9 


»T = .594 


The Matched r-test allowed us to reject :he null hypothesis and conclude that 
students did improve. The strength of association between the pretest and 
posttest is .594, showing that 59% of the variance in posttest scores can be ac-^ 
counted for by the pretest scores. That leaves 41% still unaccounted for ("error" 
as far as your research is concerned). 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 10.2 

► 1. In the Guadalajara research problem (practice 10.1.2), the observed value 
of t was 1.99. It makes no sense in this case to do an eta 2 since the / value was 
not statistically significant. Imagine, however, that r-observed had been 2.99. 
Compute the strength of association and interpret your finding. 

eta J _. Interpretation:_ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Nonparametric Comparisons: Repeated-Measures 

We will turn now to other nonparametric tests that allow us to compare paired 
data—that from the same Ss on two different measures or that of matched pairs. 
These are the nonparametric equivalents to the Matched-pairs r-test. 


Sign Test 

The goal of the Sign test is similar to tha: of a Matched-pairs r-test. That is, it: 
is used to discover whether the same Ss (or matched pairs of Ss) perform in dif¬ 
ferent ways over time (or on two tests). 

Let's use the example we have cited before where a teacher wondered whether 
information from conversational analysis could be used to teach these cultural 
conventions in an LSL class. The study involved the conversational openings, 
closings, and turn-taking signals used in phone conversations. The teacher asked 
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f j,[[ students to call her during the first week of class, liach person's phone skills 
;! (the openings, preclosings, closings, and turn-taking) were scored on a 1-10 point 
scale. During the class a special phone conversation unit the teacher had devel¬ 
oped was presented and practiced. At the end of the course, the teacher again 
I had students call her and again ranked them on a 10-point scale. 

Notice there is no control group. The data are not independent since the same 
5s take part in the pretest and the posttest. It is also possible that the ratings 
given one 5 may influence those given the next 5. (If raters were trained in using, 
a scoring checkshcct, this bias could be reduced.) Look at the data in the fol¬ 
lowing table. There is evidence that suggests the data are not normally distrib¬ 
uted. Two 5s got the maximum score on the posttest, indicating a skewed 


distribution. 

' 

5 

Pretest 

Posttest 

Change 

1 

2 

4 

+ 

2 

4 

5 

+ 

3 

5 

4 

— 

4 

3 

3 

0 

5 

1 

10 

+ 

6 

0 

10 

+ 

7 

9 

0 

- 


Look at the column marked "Change." This should give you a due as to the 
difference between the Sign test and the Matched /-test. I he Sign test asks 
whether or not 5s improved (the +s), regressed (the s), or whether they showed 
no change (the Os). The test does not measure degree of change but rather the 
presence of change. 

The procedure is simple. Tally the + and — symbols under "Change." Discard 
5s who do not show change (cf. 54). Let's say there were actually 29 students in 
the class. Twenty-one 5s improved (21 + changes) while 6 5s did worse (6 - 
changes). The total number of changes would then equal 27. 

The next step is to assign R to the smaller total of changes. In the above example. 
6 5s form one group and 21 5s form the other. Since the group of 6 is smaller. 
R = 6. 

Next, we turn to the Sign test table (table 3 in appendix C) to discover whether 
;we can reject the null hypothesis. In previous tables, the first column gave us the 
df for the study. In this table, however, the first column (labeled N) shows that 
the Sign test distribution is built around the number of changes (+ or changes) 
that take place in the data. 

In our study the total number of changes was 27. To find the critical value ot = 
.05 for an N of 27, locate the 27 row and check the .05 column. The R value must 
be equal la or less than that given in the chart. The R critical value for an A' of 


Chapter 10. Comparing Two Groups: Repeated-Measures 295 


27 is 7. Our R value is 6, so we can ieject the null hypothesis. Almost all statis¬ 
tical tests require a value equal to or greater than the value listed in the table.: 
1 he Sign test (and the Wilcoxon Matched-pairs test) is unusual in this regard. 
Remember that you need a number smaller than the value given in the distri¬ 
bution chart. 

The statistical test gives us confidence in reporting that a significant number of 
Ss improved in their ratings on the posttest. We would not be wise to generalize 
the findings, however. We have a very small n size and these few 5s are from an 
intact class. Wc may not be sure about the reliability of our rating instrument. 
Also, 5s might have improved anyway, regardless of the lessons presented.: 
However, the Sign test gives us statistical evidence that supports the description 
of the data. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 10.3 

► I In the section on matched Mests (page 288), wc asked you to interpret a 
table reporting the comparison of 5s' listening comprehension scores from pretest 
to posttest where they had received a special listening comprehension program. 
A t value of 2.81 (df= 37) was obtained and the probability level was better than: 
.01. The test assessed the degree of change obtained by students. If Lazarator. 
had not been pleased with the test or thought the data were not normally dis¬ 
tributed, she could have reported the results of a Sign test instead. This would 
not look at degree of change but rather the presence of change. Here arc the: 
changes: 

Sign test on Gains After Treatment 

+23 

-13 

The N for total changes = _. The R otn is_. For a = .05, R crit =11. 

Interpret this finding . _ 


2. If you believed the test was highly reliable and valid, that the scores were in-; 
terval measurements, and that the data were normally distributed, in which re-: 
port would you have most confidence? Why? _ 


If you felt the scores only crudely represented interval data, which report would 
you feel was most accurate? Why? 
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fjjii either case, why was Lazaraton not able to generalize her findings to claim 
that the materials would be equally effective for all ESL students? Even though 
|ye cannot generalize, we use statistical tests to analyze the data. Why? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

jThis example shows us how much information can be lost when turning from a 
Matched /-test to a Sign test. Large positive changes may occur for some stu¬ 
dents and no change or small negative changes for others. The Matched f-test 
measures all these degrees of change in its analysis. The Sign test throws away 
all the information on amount of change, keeping only the notion of + change. 
Fortunately, there is a test that is a more effective nonparametric test of matched 
pairs than the Sign test. We will turn to that next. 


Wilcox on Matched-Pairs Signed- Ranks Test 

:The Sgn test uses information about the existence and direction of change in 
paired data. If we want to consider the degree of change as well as the direction 
of the differences, a more powerful nonparametric procedure can be used. T he 
Wilcoxon Matched-pairs signed-ranks test does this by giving more weight to a 
Ipair which shows a large difference between the two groups than a pair which 
Mows a small difference. 

The Wilcoxon Matched-pairs test assumes that the researcher can tell which 
Member of a pair is greater (or better) than the other. It also assumes that the 
differences between pairs can be rank-ordered for absolute size. That is, the re¬ 
searcher can make a judgment of "greater" for one of any pair's two performances 
and can also scale pairs as to degree of difference. This ability to give informa¬ 
tion not only about the difference within each pair but also the differences be¬ 
tween pairs is sometimes called measuring an ordered metric scale. In strength 
this makes the test somewhere between an ordinal scale and an interval scale. 

As an example, let's imagine that you wanted to explore the effectiveness of a 
listening comprehension program given in a language lab setting on overall 
communicative competence. To carry out the study you found eight pairs of 
students that you could match in terms of their LI, sex, scores on a language 
proficiency exam, and a listening comprehension test. In the interest of good de¬ 
sign (not as a requirement of the statistical procedure), one member of each pair 
was randomly assigned to the experimental language lab program and the other 
member to a regular language lab program. After the treatment all 5s were given 
a test of communicative competence which has five 5-point scales (possible total 
= 25). Here is our design table yet again: 
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Research hypothesis? There is no difference in communicative 
competence between the two groups. 

Significance level? .05 
1- or 2-tailed? 2-tailed 
Design 

Dependent xariable(s)? Communicative competence 
Measurement? Scores based on five scales 
Independent variable(s)? Group 
Measurement? Nominal (lab vs. control) 

Independent or repeated-measures? Repeated-measures (two 
paired groups) 

Other features? Random assignment to groups; tests transfer of 
skills 

Statistical procedure? Wilcoxon Matched-pairs signed-ranks 


Box diagram: 

EXPER CONTROL 


Here are the data: 


Pair 

Exp 

Coat 

d 

Rank 

a 

25 

14 

11 

+ 1 

b 

24 

12 

12 

+ 8 

c 

22 

23 

-1 

-1 

d 

18 

12 

6 

+ 4 

e 

17 

10 

7 

+ 5 

f 

20 

10 

10 

+ 6 

g 

18 

22 

-4 

-3 

h 

16 

13 

3 

+ 2 


T = 4 

To carry out a Wilcoxon Matched-pairs test, the first step is to compute the dif¬ 
ference between each pair of observations. Pairs which have the same values 
(where the difference is 0) arc dropped from the analysis and the n size reduced: 
accordingly. Each pair, then, has a d score. The ds arc next ranked without re¬ 
gard to sign, with the rank of 1 given to the smallest d group. A d of -3 is ranked 
the same as a +3 d. If there are lies for a rank, the rank is averaged as it was for 
the Ranks sums test. 

Once ranked, each rank is assigned a sign of + or - to indicate whether it is from 
a positive or negative difference. If the null hypothesis is true, some of the high 1 
ranks will have +s and some will have -s. so that when we add up the positive 
ranks and the negative ranks for each group, they would be nearly equal. To 
reject the null hypothesis the sum of ranks either for the +s or for the s must 
be much higher. 
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§As with the Rank sums test, the Wilcoxon has us add up the ranks of the +s and 
the — s. The smaller of these becomes T. T is the sum of the ranks of whichever 
ffsiim is smaller. Our next problem is to determine the level of significance of T. 
\This determination depends on whether the study used only a few pairs or used 
4more than 25 pairs. If the sample size is small, as it is in our example study, the 
IT can be checked against table 4 in appendix C. 

Our T value - 4, and there are 8 matched pairs. Check the table to be sure we 
?can reject the null hypothesis. If the observed T value is equal to or less than the 
critical value given, we can reject the null hypothesis. The N = 8 and the table 
says that we can reject the null hypothesis at the .05 level because T critical - 
4, which is equal to our T observed. In reading the table, remember that (as with 
the Sign test) your obtained value must be equal to or less than the value given 
in the chart. It's easy to misread the chart, but you will notice that the numbers 
get smaller, not larger, as you move from .05 to .01 on the chart. That should 
{help to remind you that the value needs to be small enough (not large enough) to 
fallow you to reject the null hypothesis. 

You could, of course, use the Sign test on these data. Before you do so, however, 
ithink about the information you would lose if you chose a Sign test instead. 
fNotice that the two minus ds are among the smallest differences found. The Sign 
•test is not affected by the degree of difference between the pairs. It is less likely 
ifthat you will be able to reject the null hypothesis if you lose this information. (If 
iyou are not sure about this, try the Sign test on the data.) The Sign test, because 
-lit is less powerful, gives you more chances of being wrong in rejecting the null 
{hypothesis when you shouldn't and accepting it when you should not. 

iWhen the sample size (the N for number of pairs) is greater than 25, we cannot 
use the T table in the appendix to check the significance of the T value. When 
the number of pairs is greater than 25, the z formula is used: 


N(N+ 1X2 N+ 1) 

V 24 

The numbers 4 and 24 are constants in the formula; they do not come from the 
data. Actually, this formula will work even with small sample sizes. Let's plug 
pn the values from the above example and see what happens. 


/ (8X9X17) 
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Checking this / value in the z score table its the Appendix shows that the pioba- 
bility of a z value this extreme is .05 for a two-tailed test. This is the same 
probability as vve found using the T value distribution table. 

Actually, the above example was part of an evaluation study of the language lab 
program in the Philippines (Hensley & Hatch). Twenty-nine pairs of 5s par¬ 
ticipated in the study. The z score was 2.05 (two-tailed probability = .04). The 
students who participated in the listening comprehension program in the lan¬ 
guage lab were rated higher than the control group on the communicative skills 
following the special program. 


Strength of Association: eta 2 

eta 2 may be used to show the strength of association for data analyzed using the 
Wilcoxon Matched-pairs signed-ranks test. Since the Wilcoxon yields a z value, 
the eta 2 formula is: 


N- 1 


If we apply this formula to the fictitious data above, we find: 

2 z 2 


i -1 .'«> 
n 2 .549 

There is a strong relation of the two variables in this particular data set (probably 
more than would be found in real data). The overlap of the two variables is 55%, 
leaving only 45% yet to be accounted for. With this association, we would be 
very confident of our conclusion (that the two variables are, indeed, related in an 
important way). We are free to consider which variables we want to research 
next to reduce the remaining "error.'' 

There is no strength of association formula for the Sign test. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


Practice 10.4 

1. If you would like to compare all three procedures for matched pairs of data, 
apply the Matched-pairs /-test and the Sign test to the data on page 298. How 
do you account for the differences in the results of these three procedures? 
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2. To give you the opportunity to practice the Wilcoxon, try the following 
problem. Imagine that you wish to replicate Weinstein's (1984) study which 
Iboked at the use of reduced forms by speakers in formal and informal talk. As 
part of the replication, you want to determine whether the rate speakers use dif¬ 
fers under the two conditions (formal/informal). Here arc fictitious data for the 
Replication study (remember each 5 was observed twice): 

Syllables per Second 



Formal 

Informal 

SI 

6.3 

6.4 

S2 

3.6 

3.4 

S3 

4.5 

4.8 

S4 

5.7 

5.3 

S5 

4.8 

5.1 

S6 

6.6 

5.8 

S7 

5.8 

6.3 

S8 

5.4 

6.8 

S9 

4.2 

4.8 

SIO 

3.1 

4.4 

Sll 

4.0 

4.7 

S12 

5.5 

6.8 

S13 

5.6 

6.3 

S14 

4.7 

5.8 

S15 

6.3 

6.3 


ive your calculations in the space below. Remember that when there is no 
hange in a pair, that pair is dropped from the analysis and the number of pairs 
educed accordingly. 


an you reject the null hypothesis of no difference in rate for the two conditions? 


► 3. The Wilcoxon test gives us a z value, so the test of association is the same 
s for the Median test: 




alculate and interpret the strength of association (eta 2 ) for these data: 

=_Interpretation:_ 
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ooooooooooooooooooooooooooooooooooooo 


Deciding Among Procedures 

In chapters 9 and 10, we have discussed a variety of tests which can be used to 
determine whether the data from two groups differ. The tests used to compare 
two groups of different 5s (between-groups comparisons) contrast with those that : 
compare the same 5s or matched 5s (paired comparisons). The statistical tests 
used are also determined by the nature of the data. If the data are measured on 
an interval or strongly continuous ordinal scale (rank-ordered) and if the data are 
normally distributed (i.e., the A' is the best measure of central tendency), a 
parametric test should be used. A regular or Matched-pairs /-test is the best 
choice. These tests use the most information in the data and, therefore, are more: 
powerful tests. 

When the data do not meet these requirements, a nonparamctric test should be 
selected. When the data are betvvecn-groups, then the Wilcoxon Rank sums or- 
the Median test should be used. For repeated-measures designs, use the 
Wilcoxon Matched-pairs signed-ranks test. If the data have only the crudest of: 
ordinal measurement, the Sign test can be used. 

I here are other choices available, for example, you may see the Walsh test and 
the Randomization test reported in some studies. (The Walsh test is sometimes- 
used for 15 or fewer pairs and the Randomization test is sometimes used as an 
alternative to the Matched /-test where interval data is available.) If you arc in-, 
terested in the range of tests available, please consult the SAS and SPSS-X man¬ 
uals or Siege! (1956). 

The following chart may be helpful in selecting the most appropriate test for 
comparing groups. 


Msrmt 

Between-groups 

Repeated or Paired 

Ordinal 

Median 

Median 

Wilcoxon Rank sums 
(aka Mann Whitney U) 
Komogorov-Smirnov 

Sign test 

Wilcoxon Match.-pairs 
signed-ranks 

Walsh 

Randomization 

Interval 

Mean 

/-test 

Matched /-test 
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jYe will conclude this chapter by listing some of the advantanges and disadvan¬ 
tages inherent in selecting a nonparametric test over the parametric Mest proce¬ 
dures. 


/Vdvantages/Disadvantages of Nonparametric Procedures 

Advantages 

1. Probability statements of nonparametric tests are exact proba¬ 
bilities regardless of the shape of the population distribution 
from which the sample was drawn. The test does not depend 
on an assumption of normal distribution in the population. 

2. If sample sizes are small, a nonparametric test is best unless the 
nature of the population distribution is known to be normal. 

3. Nonparametric tests can treat data which inherently can be 
ranked as well as data whose seemingly numerical scores have 
the strength of ranks. T hat is, we may only be able to say that 
a 5 is more X than another 5' without knowing exactly how 
much more. Whenever a scale is "less" to "more" or "better" to 
"worse," the numbers on the scale can be treated using 
nonparametric tests. 

4. Nonparametric tests (e.g., Chi-square and McNemar's tests in 
chapter 14) can be used with frequency data (nominal vari¬ 
ables). 

x Nonparametric statistical tests typically arc easier to calculate 
by hand. Most statistical packages for computers include 
nonparametric as well as parametric procedures. 

Disadvantages 


1. Nonparametric tests are "wasteful" of data. Information is lost 
when we change interval measurements to ordinal ranks or 
nominal measurements. Parametric tests use more information 
and thus are more powerful tests. (If you use approximately 
10% more 5"s or observations with a nonparametric procedure, 
you can capture back some of this power.) 

2. Less powerful tests are more likely to lead the researcher to err 
in claiming differences where none exist or in claiming no dif¬ 
ferences when they actually do exist (Type 1 and Type 2 errors, 
respectively). 

3. Tables for the probability of nonparametric tests do not appear 
in most statistics books so, if the analysis is done by hand, they 
may not be as easy to find. 


Chapter 10. Comparing Two Groups: Repeated-Measures 303 


4. Not all computer packages include a full range of 
non parametric tests. You may have to search through several 
to find the procedure you want. Most packages carry a full 
range of parametric tests. 


Activities 

Read each of the following summaries. Decide which might meet the require¬ 
ments of a Matched-pairs /-test procedure. Decide which might best be analyzed ; 
using a nonparamctric test. Determine which statistical procedurc(s) you would : 
use for each study. If you feel that none of the procedures we have described so ; 
far is appropriate, explain why. 


I. M. Benson & E. Benson (1988. Trying out a new dictionary. TESOL Quar¬ 
terly, 22, 2, 340-345) asked whether the BBI Combinatory Dictionary of 
English -a dictionary which provides information on grammatical and lexical 
collocations--could be used to improve the ability of advanced learners to supply- 
appropriate lexical collocations. Thirty-six EEL teachers (I*; Hungarian and 17 
Soviet teachers of English participating in a summer program in the United j 
States) served as 5s. A collocation test was devised with 25 items .such as: 


Her lawyer wanted to _ 
This drug is effective _ 


_(enter) a plea of not guilty. 
_(against) the common cold. 


Pretest and posttest results are given in the following table. 


Group 

Hungarian 

Soviet 


Pretest 

43 

32 


Posttest 

92 

93 


The authors say the data indicate remarkable improvement by the participants 
in each group. Decide what statistical procedure you would use to test this im¬ 
provement. What statistics, aside from these figures, would you need in order to 
carry out the proccdure(s)? 

2. R. Budd (1988. Measuring proficiency in using English syntax. SYSTEM, 16, 
2, 171-185) applied K. Hunt's sentence-combining method to encourage develop¬ 
ment of syntactic complexity in the writing of school children in the United 
States. The author was especially interested to see what effect this method might 
have when used with ESL students-in this case, Malaysian adolescents. Among 
several tables is the following which shows the results for various indices of syn¬ 
tactic proficiency for one such group of students: 
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Synopsis Scores for Ist-year Malaysians 


Measure 

Start of Year 

End of Year 

w/c 

6.81 

7.38 

c/T 

1.30 

1.43 

w/T 

8.85 

10.50 

T/s 

1.18 

1.61 

w/s 

10.46 

12.08 


w!c * words per clause; c/T « clauses per T-unit; w/T = words per T-unit; T.s 
* T-units per sentence; w/s = words per sentence. 

These indices allow the author to state that after nine months (approximately 144 
hours of instruction), the scores of these students showed them jumping from 8th 
to IOth grade on w/c and w/T indices and from the 6-8 grade range to the 10-12 
grade range on c/T and w/s indices. No procedure was used to test the differ¬ 
ences shown in the above table. Decide what procedure(s) might be appropriate 
:for the data. What further statistics would you need in order to be able to test 
the differences? 

3 . D. L. August (1987. The effects of peer tutoring on the second language ac¬ 
quisition of Mexican-American children in elementary school. TESOL Quarterly, 
21, 4, 717-736.) August examined the effect of a special type of peer tutoring on 
the second language development of two groups-children acquiring English (ex¬ 
periment 1) and those acquiring Spanish (experiment 2). Peer tutoring, here, 
featured the learner as the- "knower" of information. The learners were shown 
how to play a game or perform an enjoyable task such as baking cookies. They 
then selected a fluent speaker of the target language to instruct in a one-to-one 
interaction. The general research question was whether peer interaction would 
better promote acquisition than small-group instruction. Prior to the experiment, 
learners were matched on several criteria and then randomly assigned to either 
experimental or control groups. There were six experimental and six control Ss 
in experiment 1 (ESL) and seven experimental and seven control Ss in experiment 
2 (SSL). Both experimental and control groups made gains over the instructional 
iperiod. In the ESL experiment, peer interaction promoted greater use of the 
target language as measured by mean frequency of target language utterances to 
peers and proportion of target language to peers before, during, and following 
treatment. In the SSL experiment, peer interaction resulted in an almost total 
absence of Spanish use. The author concluded that interaction was beneficial to 
ESL students and discussed the difficulty of helping SSL children acquire 
Spanish in an environment where English is the language with more status. 


4. D. Fuller & R. Wilbur (1987. The effect of visual metaphor cueing on recal. 
of phonologicaliy similar signs. Sign Language Studies, 54, 59-80.) noted that 
sign-naive adults recall newly learned manual signs better if they are grouped by 
semantic similarity (e.g., apple, orange, banana) than by phonological similarity 
;(c.g.. similar handshape, location, or movement of the sign). However, the au¬ 
thors provided visual metaphor cues to phonologicaliy similar signs by demon¬ 
strating, for example, the pouring metaphor presented with an extended thumb 
|jn a pouring action (signs: chemistry, drunk, pour, put gas in car). The scales of 
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justice metaphor was shown where hands move simultaneously and in vertically 
opposite directions to give a visual illusion of balance (signs: balance, depend 
doubt, if weigh, which). Nevertheless, learners still recalled signs presented in 
semantic groups better than the signs presented in the two conditions of : 
phonological similarity (with and without metaphor cues). The visual metaphor 
explanations did not significantly improve recall of phonologically similar signs. 

5. Draw a chart of your own design which snows when each of the following tests 
might be used: Rank sums, Wilcoxon Matched-pairs, Sign test, /-test, Median 
test, and the Guttman test of scalability. (Try to include the notion of 
nominal/ordinal/interval or noncontinuous/'continuous in your chart.) 
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Chapter 11 

Comparisons Between Three or 
More Groups 


• Parametric/nonparametric choice 

• Parametric tests for comparing three or more groups 

Balanced and unbalanced designs 
Fixed effects and random effects designs 
•One-way balanced ANOVA 

Assumptions underlying ANOVA 
Tests for locating differences among means 
Strength of association: omega 2 
•Nonparametric tests for comparing three or more groups 
Kruskal- Wallis test 

Nonparametric tests for locating differences among means 


Parametric/Nonparametric Choice 

As was the case in comparing two groups, the first decision in selecting a statis¬ 
tical procedure to compare three or more groups is that of a powerful parametric 
test or of its nonparametric counterpart. Parametric tests, whether they compare 
means of two groups (as does the r-test) or the means of three or more groups, 
gpike the same assumption that the X is the best measure of central tendency for 
the data and that there is a normal distribution in the population. As we have 
said many times, one way to promote normal distribution is through random se¬ 
lection. A second method that might improve the chance of getting a normal 
distribution is to use a large number of subjects and randomly assign them to 
groups. When the basic assumption of normality cannot be met, it is always best 
to select a procedure that does not use these measures to estimate parameters in 
the population. On the other hand, when normality can be assumed, it would 
be a serious mistake to use a nonparametric procedure because we want to use a 
powerful test which uses,as much information as possible in the data. 

Parametric tests allow us to compare group means. These means are obtained 
from_ continuous data where the spread of scores can be appropriately captured 
by X and s.d. or variance. A second assumption of such tests, then, is that of a 
strong underlying continuity to the data. Ordinal scale data show such continuity 
when the scale approaches interval measurement. When you are not sure about 
the linearity of the scale itself or when the distribution of data on the scale do not 
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appear normal (i.e., tend to cluster around one or two points), a non para metric 
procedure is the better choice. 

Nonparametric tests arc, indeed, easier to do by hand. However, this should not: 
be a reason for selecting a nonparametric procedure! 


Parametric Tests for Comparing Three or More Groups 

In chapter 10, we said that we cannot use the f-test procedure to make multiple 
comparisons. That is, the /-test allows us to compare two means--not cross*: 
compare group I with group 2, group I with group 3, and group 2 with group 3. 
To do this, we will turn to an Analysis of Variance procedure, or ANOVA. h 
the r-test we compared two Xs and wc also used the variability of the data in each 
group as well. The measurement of variability was standard deviation. In; 
ANOVA, we will compare three or more A’s and the measure of variability used 
is variance. Hence the name Analysis of Variance. You should remember that 
variance and standard deviation differ only in that the variance equals s 2 . 

The ANOVA procedure is a powerful and versatile test, for it allows us to com¬ 
pare several means simultaneously. The type of ANOVA used will depend on the 
design of the research project. Imagine that you have administered a placement: 
examination to a large groups of 5s from five different schools. You want to| 
know if the performance differs across the five schools. The design box for this 
study would look like this: 

SI S2 S3 S4 S5 


There is one dependent variable, performance on the examination. There is one 
independent variable with five levels. In the box diagram, the comparisons are 
made in only one direction-that of the independent variable. Therefore, the 
ANOVA used is called a "One-way" ANOVA. 

It’s possible that you might want to compare student scores on the test not only: 
across five schools but by sex of the 5s. The amended design box would be: 


SI S2 S3 S4 S5 
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The comparisons are now drawn in two directions--for school and for sex—and 
the interaction (between sex and school) is also shown. This is a two-way design. 
A One-way ANOVA can no longer be used. 

Ift's possible that as an administrator of an ESL program, you might want a 
breakdown of scores by LI membership (say four major language groups). The 
comparison will be drawn in three ways: 



As you can see, the designs that use ANOVA may be 1-way to ra-way compar¬ 
isons. The more directions in which comparisons are drawn, however, the more 
difficult the interpretation of results becomes. This is because it is very hard to 
interpret higher-order interactions. (Also, as the design becomes more complex, 
there are more cells to fill with observations, so you need more Ss.) 

It is possible that the design will be a between-groups comparison. In this case, 
the data from a S appears only once in the analysis. A S 's score is never com¬ 
pared with a second score from the same S. In some studies, repeated-measures 
comparisons will be drawn. And in others, a mixed design will be used where 
comparisons are drawn between groups and within groups as well. The type of 
ANOVA to use changes according to the design. 


boooooooooooooooooooooooooooooooooooo 

Practice 11.1 

► 1. Foreign students in a literature class were asked to rate short stories covered 
during a course. Each story had a particular theme which might influence the 
"appreciation" ratings. The data, then, consist of scaled ratings of four separate 
themes. Research question: Are the ratings the same for each theme? 
Between-groups, repeated-measures, or mixed design?_ 
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► .McMeans (1986) computed Uie OTBS (California Test of Basie Skills): 
scores r;f a group of second-grade immersion students according to their reports 
on when they first began reading in Spanish (preschool, kindergarten or first 
grade). Research question: Do 5s who report, beginning reading in Spanish at 
different times have different scores on the CTBS? 

Bctwecn-groups, rcpcatcd-mcasurcs, or mixed design? 


► 3. In chapter 6, page 182, we displayed a graph based on McGirt’s study 
comparing composition scores of ESL and regular English classes. The graph: 
showed the scores when the compositions were simply typed and presented to the: 
raters. In a second condition, the morphology and syntactic errors of all thei 
compositions were corrected, and the compositions were scored by raters. Re- 
search question: Are the ratings given native speaker and ESL compositions the 
same or different, and does this change if the morphology and syntax errors are 
removed from both sets of compositions? 

Between-groups, repea ted-measures, or mixed design? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO : 

Ha la need and l 'n balanced Designs 

Because ANOVA is a fairly lengthy statistical procedure, it is not often carried 
out by hand. If you do plan to use an ANOVA procedure and expect to do ;t 
by hand, it is important to try to get a "balanced" design. 

Designs that are balanced (sometimes called orthogonal) have equal (or propor¬ 
tional) >i si/cs for all the groups and subgroups to be compared. Unbalanced de¬ 
signs (nonorthogonat) are those where some groups have more 5s or observations 
than otners. The statistics for a balanced design are much easier to carry out by: 
hand than those for an unbalanced design. For the computer this is no problem; 
it can easily make all the necessary adjustments in computation. There is an ad¬ 
ditional reason for using a balanced design, and this has to do with the assump¬ 
tion of equal variances. When cell sizes are not equal, the assumption of equal 
variance may be violated. (If you use a computer program, it will give you in- : 
formation on the assumption of equal variance.) While ANOVA is known as 
being a robust test (that is, we can violate the assumptions to some degree and 
still have confidence in the outcome), it is still sensitive to violation of both as- i 
sumptions. 

Let’s first consider the notion of balance before we turn to the details of the 
ANOVA procedure. Thirty students from each of five schools were given a lan¬ 
guage aptitude test. If the comparison is to be drawn across the five schools, this : 
is a balanced design because there are 30 5s in each of the 5 cells; the n is equal: 
in each cell. Of the 30 5s, 15 were females and 15 males. The design is still ba'.-: 
anced at this point. There are 15 5s in each of 10 cells. Five males and 5 females 
at each school were Spanish speakers, 5 males and 5 females were Korean 
speakers, and 5 males and 5 females were Vietnamese speakers. It is still a bal¬ 
anced design; there are 5 5s in each of 30 cells. 
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if he box diagram for the study and the n size in each cell are shown below. Each 
fell has five observations, the minimal size possible for ANOVA. (Many statis¬ 
ticians would argue that the cell size is too low. That argument is based on two 
Issues. First, with such small n sizes, it is difficult to find an effect. Second, all 
&ou need is one person to throw everything off. For example, one Korean female 
lit school 3 might for some reason perform in a very atypical way and distort the 
Observations for all females. However, for simplicity of calculations and to dem¬ 
onstrate the notion of balance, we have used a minimal 5 5s per cell.) 


Spanish 


Korean 


Vietnamese 


School 

In Lazaraton's study evaluating a listening comprehension program (chapter 9, 
page 260), comparisons of gain scores in listening comprehension were drawn 
between an experimental class with 19 5s and a control class with 20 5s. This is 
lot a balanced design because there are an unequal number of 5s in each cell. 
Comparisons were also drawn for visa status (immigrant vs. foreign student) of 
5s in these intact classes. It is likely that this variable was unbalanced as well. 
Comparisons for male and female 5s were also made. It is improbable that this 
variable was balanced. This is typical of research which involves intact classes. 
Designs for such studies are seldom balanced. 

If your design does not have equal ns for cells, it is still possible to do the analysis 
by hand. However, we would recommend that you consider either dropping or 
adding 5s (or observations) so that the cells are equal in size. If you have a 
substantial number of cases per cell, it is possible to drop a few cases to equate 
cell sizes. This must be done with care. You cannot go through the data to see 
which do not fit your hypotheses! The best solution is to delete cases randomly 
■within cells. That is, each 5 (or observation) should have an equal chance of 
deletion. If you had three groups with 33, 30, and 32 cases, you could delete three 
from group 1 and two from group 3 simply by numbering each case and then 
randomly selecting numbers for deletion. The outcome should not be substan¬ 
tially changed if the Ss were randomly selected in the first place. For intact 
passes, particularly those with small n sizes, deletion of even one 5 or observation 
could dramatically change the outcome. In such cases, a better method for 
^equating groups is to add 5s. This can be done by adding data at the X. To do 



Chapter II. Comparisons Between Three or More Groups 311 



















this, compute the X for the cell with misting data (nut the mean Put all the data) 
Since this will not change the mean for the group, it will not change the results ! 
It will also be necessary to adjust the "error" term (which we will talk about later) 
If at all possible, the best solution is to use a computer to run ANOVA. It can 
easily accomodate unbalanced data. If that is not possible, consult your ad\iso- : 
or statistical consultant to be sure that you have used the best method of equating 
the n for each group prior to running the statistical procedures. To calculate an^ 
unbalanced design by hand is a very involved procedure. We recommend you 
avoid it if at all possible. 


Fixed Effects and Random Effects Designs 

There is one final consideration in using ANOVA for group comparisons. De¬ 
signs ate classified as either random effects or fixed effects designs. A fixed effects 
study is one where the levels of a variable have been specified according to the 
research question. That is, we might wish to compare the levels of beginning, 
intermediate, and advanced learners: three fixed levels of the variable language 
learner. This is a "fixed effects" design. In the analysis, we can test for differ¬ 
ences among these fixed levels. In turn, we can only generalize our findings to 
these same levels in the population. In a random effects design, the 5s are ran¬ 
domly selected and the levels arc also randomly set. For example, in large-scale 
questionnaire research, various proficiency levels or age levels might be randomly 
sampled from the population. In this case, the researcher hopes to be able to 
generalize not just to set levels in the population, but across the population at 
large. Usually a randomly selected stratified sample has been drawn from that 
population. If the data are randomly sampled, the design is a random effects^ 
design. Since we seldom carry out research using a random effects design, the 
procedures illustrated in this chapter and the next will be that of fixed effects 
designs. 


One-way Balanced ANOVA 

The goal of ANOVA is to explain the variance in the dependent variable in terms 
of variance in the independent variables. In a one-way design, there is only one 
dependent variable and only one independent variable with three or more levels.; 
The comparisons of the means on the dependent variable are made across the 
levels. The levels, however, may be within-groups, a repeated-measures design 
where the same 5s do different tasks or the same task at different times, or 
between-subjects, where each group is composed of different 5s. When we com¬ 
pare the scores of our students before instruction, and every two weeks thereafter, 
the comparison is within-groups. When we compare the final exam scores of 
students in five sections of a course, the comparison is between groups. In the 
explanations that follow we will be discussing the use of One-way ANOVA with 
between-groups design. There are ANOVA procedures for repea ted-measures 
designs so by understanding how ANOVA works, you will also understand how 
a repeated-measures ANOVA works. We will discuss ANOVA with repeated- 
measures in the next chapter. 
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Assume that you have taught one class level five different times. Each time you 
yaried the methodology. One time you used the Natural Approach, another time 
Total Physical Response, then Silent Way, the Situational/Functionai Approach, 
and Language Through Drama. 

While the methodologies do overlap, you still wonder whether there are differ¬ 
ences on final exam scores following these five approaches. Remember that there 
a re several threats to the validity and reliability of this study. It is not an exper¬ 
imental study. The 5s were not randomly selected or randomly assigned. 
Whatever the findings, the results cannot be generalized to all learners at the level 
you teach. The A NOVA procedure is being used for descriptive rather than 
inferential purposes. 

The design box looks like this: 

Approach Used 

Natural TPR Silent SitjFunc Drama 

Scores n = 15 n = 15 n = 15 n = 15 n= 15 

In displaying the data on the final exam scores, we could place the X and the s 

in each box. We want to be able to reject the H 0 which states there is no differ¬ 
ence in final exam scores for thejrve methodologies. In statistical terms, the H a 
for this study is X x - X 2 = X 3 = V 4 = X 5 . 

We would, of course, be surprised if each group obtained exactly the same X and 
s. The question is whether the means are far enough apart that we can say they 
are not five sample means drawn from the same population (with an unknown fi 
and tr 2 ). To answer this question we will again use sample statistics to estimate 
these parameters in the population. 

In the example, there are 15 5s for each of the treatments. It is unlikely that the 
fhean for each group will be exactly the same. If there are differences among the 
groups, this is hopefully due to the treatment (i.e., methodology used). In addi¬ 
tion, there is variability of the data within each group (ANOVA uses variance 
rather than standard deviation to examine this distribution of scores within each 
group) and this variability within groups will not be exactly the same for each. 
The variability within the group is due to individual differences among 5s rather 
than to the treatment. This is called error variability, symbolized as s 2 uhln . It is 
"error" not because it's wrong in any way but rather because it is not the focus 
of the research. The variance is due to something other than the treatment. 

2 

Error variability = within-group variance = s wllftin 

phis value for within-group variance will be our first estimate of the population 
Variance, a 2 . We expect that the variance within each group is the result of 
normal distribution of scores (rather than the effect of the method). The within- 
group variance is an unbiased estimate of a 2 because it does not include the effect 
(of the treatment. 
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[‘tic second estimate is based on the differences across the groups. Each group 
will perform differently if the methods are not ail equally effective. In ANOV’a 
this variability among the groups, called ^i elween , is attributed to two things: 

1. Nonrandom, systematic variation between the groups due to the treatrnen-■ 
effect. 

2 . Random, unsystematic, or chance variability between groups which is error 
variability. 

Between-groups variance, then, is: 

Error variability + treatment effect = between-groups variance = sl etwe 

'Ltwetr. our second estimate of <r 2 . We said that within-group variance was not 
biased by treatment. Between-groups variance is biased by treatment because it • 
contains variation due to treatment as well as variation due to error. 

If the methods are truly different, we expect the scores for the five groups to be 
pushed far apart, so the variance across this estimate should be large. We expect 
the variability within the groups to be relatively stable across each group. In the 
best of all worlds, we would expect the error variance-variance due to the normal 
distribution of individual scores—to be the same between groups as within groups. 

Since we want to be able to reject the null hypothesis, we hope the between- 
groups variance will be large and we want those differences to be due to treat¬ 
ment rather than to error. If the treatment effect is strong, the between-groups 
variance should be large, showing the groups pushed far apart rather than over- : 
lapping. A simple, common-sense understanding of ANOVA is the comparison ' 
of variance between groups and variance within groups. If the variance between 
groups (which includes the treatment effect) is not greater than the variance: 
within groups (which does not include the treatment effect), then wc know the 
treatments are all similar. We will not be able to reject the H 0 . We must consider 
them all as just similar samples from the same population. 

On the other hand, if the value of between-groups variance is greater than 
within-group variance, we are set to go. There is some difference between the five 
groups. Now the problem is discovering whether they are different enough that 
the difference is not due to chance. So (guess what), we will again place the dif¬ 
ference in a sampling distribution to discover the probability of finding a differ¬ 
ence as large as that we obtained. The sampling distribution for ANOVA is; 
called the F distribution. 

Fortunately for us, mathematicians have already constructed this distribution 
(see appendix C, table 5). Like the t distribution, it is made ap of families with: 
the same number of degrees of freedom in the sample size, but, unlike the i 
distribution, the F distribution also considers the degrees of freedom for number 
of groups. We don't have to consider the degrees of freedom for number of: 
groups in the /-test because in the /-test there are always only two groups and the 
lit lor groups is always 1. 
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'jo discover whether we can reject the null hypothesis, we once again consult the 
critical values given in the table for the sampling distribution. 

In the Mcst, we obtained a r-value. In ANOVA, we will obtain an F value. The 
f value is the ratio of the two sources of variance-between-groups variance over 
within-group variance. 

r? s betv?een 
* obs ~ 2 

■ s within 

Another way of saying the same thing-- 


F — S between _ error + treatment 
obs ~ 2 — error 

s within 

If there is no effect for treatment we would end up with a ratio of 1, right? 

error + treatment _ error _ . 
error ~ error * 

Thus, we need an F ratio > l to show any difference at all among groups. How 
much larger than 1 we need depends on the number of degrees of freedom within 
pie groups and the degrees of freedom for the number of groups. 

Let's apply this to the example to see how it works. The data from the exam 
scores of the five different groups follow. 


Exam Scores for Five Methods 



Natural 

TPR 

Silent 

SitjFunc 

Drama 

Mean 

64 

60 

61 

65 

57 

j 2 

110 

90 

95 

100 

110 

n 

15 

15 

15 

15 

15 

df 

14 

14 

14 

14 

14 


Number of groups (K) = 5 
Total observations (AO = 75 (15 per group) 


Arithmetically, the means of the groups differ. It appears that the situational- 
functional approach is best, while the natural approach is second best. Worst 
appears to be drama. We cannot, however, be confident that they are truly dif¬ 
ferent without taking the n size and variance into account. 

fLook at the following two visuals. 
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In the first figure, there is a lot of variability within groups, but not much be¬ 
tween. There is a lot of overlap. These differences probably are not statistically 
significant. However, in the second diagram, there is little variability within 
groups and much more between. There is not much overlap. And since statistical 
significance for ANOVA is determined by the ratio of variability between to 
variability within, the second diagram probably shows differences which are sta¬ 
tistically significant. In any case, we cannot know for sure without calculating 
the F ratio. 


Reniemner that s 2 is used to represent population variance. This value is esti¬ 
mated by squaring the standard deviation. You know that we must look at the 
h ratin-thc ratio of variance between over the variance within groups. We won't 
do the calculations just yet, but let's assume that we found a between-groups 
variance of 264. This represents the total variability present between the scores 
of the five groups. The within-group variance turned out to be ‘>5. This repre¬ 
sents the total variability of scores within each of the five groups. 


We can. then, compute the F ratio: 


The subscripts B and W stand for between and within. The ratio is larger than 
I. So, there is some effect for treatment. To discover whether the F value is large 
enough to give us confidence in rejecting the H c , turn to the F distribution in table 
5, appendix C. We know the probability level will depend on the number of df 
for groups. There are five groups (K). The df for groups is K - I = 4. There arc 
15 5s ir. each group. To find the df for each group we use n - 1 or 14. There 
are five groups, so we multiply n 1 by 5, 14 x 5; the df for within-group vari¬ 
ance is 70. 

The numbers across the top of the table are the degrees of freedom for groups. 
The numbers down the side of the table are the degrees of freedom for 5s or 
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lumber of observations. For our example, look across the top of the table for the 
f -distribution to column 4 ( df for K). Then go down the column to row 70 ( df 
for the number of observations). Notice that there are two sets of numbers in 
each place. The top set are for a critical value where p = .05. The lower set is 
for a critical value where p = .01. We should use the .05 critical values of F be¬ 
cause that's what we specified before the analyses. F crit for an .05 level of confi¬ 
dence (for df of 4, 70) is 2.50. We can reject the H a since F obJ is greater than 
. -f crlr We can, therefore, claim that scores on the exam differ across the groups. 

In this example, the data are from a relatively small number of 5s. Remember 
that the smaller the number of 5s (or observations), the larger the differences re¬ 
quired to obtain statistically significant results. 

Mow that we have gone over the basic concept of ANOVA, let's actually compute 
an F ratio for a One-way ANOVA working from raw data. 

The first step is to arrange the data in a sensible, organized way. This is impor¬ 
tant whether you expect to enter the data directly into the computer or plan to 
work from a data table and carry out the procedure by hand. If the data are 
organized in the following way, it will be easy for you to find all the figures you 
need for hand computation. 

Because we don't want to make the hand calculation too cumbersome a chore, 
Ve will limit the number of groups to 5 and the number of 5s within each of these 
groups to 10. We decided to use equal numbers of 5s in each group because the 
computations are simpler with a balanced design. (Aren't you glad we did?) 

Since the number of 5s is relatively small, we also know that the F ratio will have 
to be relatively large before we can feel confident in rejecting the H 0 . 

The table for our research example follows: 


Research hypothesis? There is no effect of first language on vocab¬ 
ulary test scores (i.e., the means of the five groups will not differ). 
Significance level? .05 
I- or 2-tailed? 2-tailed 
Design 

Dependent variable(s)? Vocabulary test scores 
Measurement? Scores (interval) 

Independent variable(s)? LI group 
Measurement? Nominal (5 levels) 

Independent or repeated-measures? Independent groups 
Other features? Fixed effects design 
Statistical procedure? One-way ANOVA 


As the above box shows, the H 0 for the study is that there is no difference in vo¬ 
cabulary scores for different LI groups. The dependent variable is vocabulary 
score. The independent variable is LI. This is a one-way design because there 
is one dependent variable and only one independent variable (with five levels). 
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We assume the test scores are interval data and this is a fixed’effects design! 
I herefore, we cannot generalize from the s:udy to other language groups not ji> 


eluded in the study. 




Vocabulary Test Data 


5 

Gpl 

Gp2 

Gp3 

Gp4 

Gp5 

1 

16 

15 

14 

14 

10 

2 

14 

13 

13 

10 

8 

3 

10 

12 

15 

9 

10 

4 

13 

13 

17 

11 

9 

5 

12 

13 

It 

11 

12 

6 

20 

20 

11 

12 

5 

7 

20 

19 

12 

10 

8 

8 

23 

22 

10 

13 

8 

9 

19 

19 

13 

9 

7 

10 

18 

17 

12 

8 

9 

n 

10 

10 

10 

10 

10 

IX 

165 

163 

128 

107 

86 

IX 2 

2879 

2771 

1678 

1177 

772 


II 

L/l 

O 

IX = 649 

IX 2 =9277 

(IX) 2 

= 421201 


oooooooooooooooooooooooooooooooooooooi 

Practice 11.2 

1. Let's be sure that the figures for the row totals (I rows) are clear. Row 
labeled n shows _• 


How many 5s are there in each group?_. 

What is the N for the study?_. 

How many df are there for the number of observations in each group?_. 

The df for Ss, then, is_. 

2. IX is obtained by_ _ 

IX 2 is obtained by 

(If you are not sure, guess and then use your calculator to check your answer.): 


(IX) 2 is obtained by 


(Again, if you are not sure, guess and do the calculations to check.) 


3. Look at the raw data. Mathematically, which group appears to be best? 
Worst?_Do the differences appear to be large?_. 
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flow much variability docs there appear to be in each group? 


Qo you think wc will be able to reject the Up. 


looooooooooooooooooooooooooooooooooooo 

{With the data in table form, we can compute the within-group and between- 
{groups variances fairly easily and then find the F ratio. 

({There arc several ways to compute the F ratio from raw data. We could begin 
{ by findingfiic grand X for all the Ss, then subtracting each individual score from 
{the grand X and squaVing all these individual differences from the A" as we have 
befoic in many of our computations. While this may make the concept easier to 
understand (since it's exactly the same kind of thing we have done in computing 
i standard deviation, z scores and in the Mest procedure), it's not the easiest com- 
i putation to do. So let's start with an easier computation and try the other 
i method later. This is called the sum of squares method. 

; Step / 

1 Square each score and add. I his is the first step in determining t he sum of 
: squares tola), abbreviated SS I in most computer printouts. Wc will use that 
: abbreviation here. 

v , 0>V 

SST- > .Y“ - - 

A 

; If you look at the data sheet, you will find that the scores of all the Ss have been 
i squared and totaled. The IX 2 value is 9277. .V is 50. (S.X) 2 is (649) 2 . Placing 
i these figures in the formula and computing the value for SST: 

sst =yv { 2ZL 

L— ,V 

SST= 9277 - = 852.98 


f Step 2 

]. Find the sum of squares between groups. This is often abbreviated as SSB (for 
: sum of squares between ), an abbreviation we will use here. 

y 

To find SSB, we add up the scores for each group. Square each sum and then 
; divide by the n for the group ((£.¥,) 2 -r n j. Add these. Then subtract the total 
; scores squared divided by N ((XAQ 2 • AO- You have this value since you used it 
: in step 1. All the figures are readily available from the data table. 


Chapter 11. Comparisons Between Three or More Groups 319 




SSB = 


< Z +> 2 ( E ^ )2 ( Z ^) 2 < E **> 2 


"i 


"2 


"3 


n k 


& 

N 


Here is the intermediate calculation step. 

SSB = (2722.5 + 2656.9 4 - 1638.4 + 1144.9 + 739.6) — 8-124.02 
SSB = 478.28 

The total sum of squares for all the data (SST) is 852.98. The sum of squares 
between groups (SSB) is 478.28. 

Step 3 

Find the sum of squares within each group. This is labeled as SSW. 

Since we know that SST = SSB + SSW, we can easily find this value by sub- 
trading SSB from SST. What's left must be the value of SSW. 

SW - SST- SSB 

We can enter the figures and do the computation for SSW 
SSW - 852.98 478.28 - 374.7 

Step •/ 

The next step "averages" each of these figures (SSW and SSB) to make them 
sensitive to their respective degrees of freedom. The result is the variance valuesE 
I o find the variance between groups, divide SSB by K I. Let's do that now. 

2 ,4 c-r> SSB 
S B - MSB - K t 

2 woo 478.28 

so — MS B — - 

4 

Sg= MSB = 119.57 

You will notice that si is also abbreviated as MSB. This stands for mean squares 
between. It is often used in computer printouts and in ANOVA tables. While 
s% better represents the notion of variance between groups, the MSB abbreviation 
appears more often in ANOVA charts in reports. Feel free to use whichever an¬ 
notation you prefer. 


To find the variance within groups, divide SSW by N - 
Ss in the study, 10 in each of Five groups, the df is 50 - 
place these values in the formula for MSW. 


K. Since there were 50 ; 
5 (/V - K) or 45. Let's ; 


,2 _ 

s w - 


MSW = 


SSW 

(N-K) 
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2 w CIJ/ 374.7 
45 


&= MSW = 8.33 


You can annotate this calculation as or as 

sfy 4'/ o H 

f== _MSB_ 

MSW 

F _ 119.57 
8.33 

F = 14.35 

;Thc F ratio must be larger than 1.0 to show that there is some difference between 
the groups. This is the case here—our F ratio greatly exceeds 1.0. 

Now we must decide whether we can reject the H 0 . To do this we use our sample 
statistics as estimators of the population parameters. Our MSB and MSW are 
the two available estimates of a 2 . Remember, o z represents population variance. 

The MSB estimate is biased for treatment. This estimate belongs to a distribution 
with df{K — \ = 5 — 1=4). The second estimate of a 2 is MSW. It is not bi¬ 
ased for treatment since it includes only error variability within the groups. 
MSW, in our example, belongs to a distribution with 45 df 
(N — AT = 50 — 5 = 45) . The F distribution is made up of families with special 
distributions according to the df for groups and for observations. Since our study 
belongs in the family with 4 df for groups and 45 df for observations, we check 
the intersection of 4 and 45 on the /'distribution chart in appendix C. Across the 
|0p of the chart the columns are labeled for the df of the numerator. In the F 
ratio, MSB is the numerator and our df for MSB is 4. Find that column. The 
df for the denominator runs down the side of the table and is the df for MSW. 
That's 45. There is a 44 and a 46 but no 45. So we select the line for 44 df. The 
critical value needed to reject the H a at the .05 level of confidence is 2.58. We can 
safely reject the H 0 at the .05 level. 

Research reports published in journals often include a special table called a 
"source table" to display the ANOVA values. The tables are a summary of the 
iesults. You should make maximum use of this information in reading reports. 
One good way to do this is to First turn to the abstract of the study; identify the 
question(s), and the dependent and independent variables. Next turn to the ta¬ 
bles and interpret them as best you can without further information. Then, with 
ifour own interpretation in mind, read the complete article and check your inter¬ 
pretation against that of the author. The following chart gives the summary in¬ 
formation for our data. 


Step 5 

Calculate the t ratio. 
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Source of variance 

55 

df 

MS 

F 

Between Groups 

478.28 

4 

119.57 

14.35* 

Within Groups 

374.70 

45 

8.33 



*p < .05 


Some reports include means square (MSW and MSB) but not the sum of squares 
within and between groups (SSW and SSB). If you think about it, this will make 
sense. You could recover SSW and SSB (if you know MSW and MSB and the 
df) simply by multiplying MSW by df w and MSB by df B . Some journals, in order 
to save space, ask authors to delete tables and just include the result of the sta- ■ 
tistical test. In this ease a statement such as the following would be used: F * 
14.35 (4, 45), p < .05. 

You might wonder why the probability level is reported as p < .05 when it actu¬ 
ally qualifies easily for p < .01. The reason is that an .05 level was selected for 
rejecting the H 0 . Many fields (and or journal editors) require that you state the 1 
level for rejecting the H a and then simply report whether or not that level was; 
met. If you exceed that, fine, but report it as < .05. Since computers can calcu¬ 
late the exact probabilities, many researchers (and journal editors) use the prob- ; 
ability specified by the computer program. So it is quite possible that you will 
find research where an .05 level has been selected but where the report, will gi\e 
the actual probability. (In many studies. \mi never know what probability was 
selected to reject the the researcher simply reports the obtained probability 
from the computer analysis.) It is wise to consult your research advisor or the 
journal editor for their views on this before deciding how to present information 
on probability for your findings. 

Since One-way ANOVA is more complex than the Mest, let's go through another 
example. Imagine that we hope to gather evidence to show that "story grammar" 
is not as universal as has been claimed. We prepare a reading passage which has 
the universal story-grammar components: orientation (setting, time, introduction 
of characters), the problem, a set of ordered actions addressing the problem, the 
solution, and a coda. We add this story with a set of 20 questions as a research 
caboose" to our general English proficiency exam. If story grammar is universal 
then 5s from all first language groups should be able to access this discourse or¬ 
ganization to understand and remember story components. Since five major 
language groups are represented in the student body that takes the exam, we can 
compare their performance according to LI membership. From 5s taking the 
test, we randomly select 5s who scored in the 80th to 90th percentile on the rest 
of the examination. This assures us that we have controlled for L2 language 
proficiency. To simplify the computations here, we have selected five 5s from 
each of the five LI groups. While these fix's 5s from each group have been ran¬ 
domly selected from the proficiency range (and we hope, therefore, that their 
means and variances will represent those of the population), we recommend much 
larger n sizes for each group in real research where inferences are to be drawn. 
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Scores on Story Grammar Questions 


Ss 

1 

Lang.Gp.I 

18 

Lang.Gp.2 

16 

Lang.Gp.3 

18 

Larig.Gp.4 

16 

Lang.Gp.5 

18 

2 

18 

15 

16 

13 

16 

3 

16 

14 

15 

14 

15 

4 

15 

11 

15 

12 

12 

5 

11 

12 

10 

12 

13 

1 IX 

78 

68 

74 

67 

74 

n 

5 

5 

5 

5 

5 

Mean 

15.6 

13.6 

14.8 

13.4 

14.8 


N = 25 Grand mean (X G = 14.44) 

If 5s from these five language backgrounds perform differently, we would be able 
to reject the H Q which states there is no difference in story grammar scores for 
different LI groups. This would give us preliminary evidence that for these lan¬ 
guage groups the proposed universal story grammar has not been used equally to 
jjromote understanding and retention of story information. Thus, it may not be 
universal at all. 

Now, to show that wc can indeed do a One-way ANOVA by_working with the 
grand X and computing the individual differences from that /V, we will present 
this second method of computing ANOVA. It may better illustrate how ANOVA 
works. 

Step / 

Compute the grand mean by adding all the scores and dividing by N. The grand 
mean is symbolized as X G . X G for the data is 14.44. 

Step 2 _ 

Subtract each score from the X G and square_the difference. When we add all 
these squared individual differences from the X 0 , the result is SST. 


Grand Mean = 14.4 


To compute SST, then, the formula says: 


SST=1{X- XqT 


Chapter 11. Comparisons Between Three or More Groups 


323 



SSI (18 - 14.44) 2 l (18 I 1.44)“ - (16 - I4.44) 2 I - f (13 - I4.44) 2 
SSI - 13616 

Step 3 

Find SSB. To do this, subtract the X G from the average score of each group , S K> 
squaring these differences as well. These differences will be multiplied by the n 
for the group and then summed. The formula, therefore, is: 

SSB Y*(.V - X G f 

For our data, then, SSB would be: 

SSB = £ n(X - X G ) 2 

SSB — 5(15.6 — I4.44) 2 5(13.6 I4.44) 2 + 

5(14.8 14.44) 2 + 5(13.4 - 14.44) 2 + 5(14.8 - 14.44) 2 

SSB r-. 16.96 


13.4-*- 

--14.8 

--14.8 


13.6 —— 


—15.6 


Grand Mean = 14.44 


Step 4 

Now that we have the sum of squares between groups, we can find SSW by 
subtracting. 

SST- SSB 
136.16- 16.96 
SSfF- 119.2 

Step 5 

These values for variability within and between groups must, of course, be ad¬ 
justed for their respective degrees of freedom. The ^between groups is K - 1. 
We have 5 groups, so the df for groups is 4. The degrees of freedom within the 
groups is N K. T he N for the st udy is 25 and the number of groups is 5, so the 
df for observations is 20. 
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Step 6 

■ Compute the F ratio. 


df B 4 

SSIV _ 119.2 
df w 20 


F _ MSB _ 4.24 _ 7 
MS IV 5.96 ' 


Now, let us show you that the two methods of computing the F ratio do obtain 
the same results. 


V 2 'Z^ 

ssr= ) x l - — 

N 


9 t 2 2 (18 + 18 + 16 + ••• + 13) 

SST = (18 2 + 18"+ 16 2 H— + 13 z )---—-— 

557= 136.16 


(E z i> 2 Z>y 2 (E^) 2 <Z a 4) 2 (Z^ 2 ! ( Z a '> 2 

fly + n 2 + «3 + «4 + «5 TV 

ccd T 78 2 . 68 2 , 74 2 67 2 , 74 2 1 361 2 

+ — + — + —+-J--23- 


"3 

«4 


" 5 J 

N 

742 , 

-« 2 + 

74 2 1 

361 2 


5 1 

5 + 

5 J 

25 



SSW= SST SSB 
SSIV= 136.16- 16.96 
SSW= 119.2 

Again, the values for SSB and SSW must be adjusted by the number of degrees 
of freedom. 
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Step 4 


Step 5. 


MSB = 


SSB 
K- 1 


16.96 

4 


4.24 


MS IV = 


SSW 

N-K 


119.2 

20 


5.96 


And the F ratio is: 
Step 6 


F MSB _ 4.24 
MS IV 5.96 


The table for the results (regardless of which method is used to compute the F 
statistic) would look like this 


Story Grammar Scores Across 5 LI Groups 


Source of variance 

55 

df 

MS 

F 

Between groups 

16.96 

4 

4.24 

.711 

Within groups 

119.20 

20 

5.96 


Total 

136.16 

24 




The F ratio is less than I so we know there is no meaningful difference among the 
five groups. They are all equally able to process and retain information from a 
reading passage which conforms to story grammar principles. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


Practice 11.3 


1. Can we say that these 5s from these language groups use a universal story 
grammar to process and retain the information? Why (not)? _ 


2. How might we design a better study to find the answer to this question? 


► 3. Imagine your F.SL program decided to find out "once and for all" the best 
method for teaching composition. Three methods were devised: method 1,2, and 

3. .S's who were required to take the regular composition course were randomly 
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^assigned to one of the Lhree methods. 5s were given the same composition topics. 
iAt the end of the quarter, the final compositions were graded by five raters and 
averaged. The scores obtained were: 


Average Final Composition Score 


Method I 

Method 2 

Method 3 

23 

12 

19 

20 

14 

18 

19 

19 

10 

21 

16 

18 

15 

12 

18 

19 

15 

20 

23 

18 

19 

24 

11 

16 

23 

15 

20 

20 

17 

21 


Fill in :he box diagram below for this study: 


Research, hypothesis? _ 

Significance level? _ 

I- or 2-tailed? . _ 

Design 

Dependent variable(s)? _ 

Measurement? _ 

Independent variable(s)? _ 

Measurement? _ 

Independent or repeated-measures? 

Other features? _ 

Statistical procedure? _ 


Using either the sum of squares or the deviation method, do an ANOVA. Show 
your calculations below. 


Can ycu reject the H 0 1 What do you conclude? 
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4. We have given you three examples of One-way ANOVA in this chapter. In 
one case wc were able to reject the ll„ and in the other two we could not. From' 
an examination of the X for each group in these studies, could you have easily 
predicted which would and which would not allow us to reject the // ? Why 
(not)?_ ~ 


ooooooooooooooooooooooooooooooooooooo 


Assumptions Underlying ANOVA 

We have touched on some of the assumptions underlying ANOVA in a general 

way. Just to put it all together, the assumptions of One-way ANOVA are: 

1. There is one dependent variable and one independent variable with three or : 
more levels. In the story grammar example (page 322), the independent var¬ 
iable has five levels (five different LI groups). 

?.. Hie data aic scoie oi ordinal scale data that aic continuous. In the Mory 
giammar example, we assume the test items yield scores that are continuous,, 
fairly equal-interval in nature, so that X and variance are the best measures 
of central tendency and variability for the data. 

3. The data are independent (not repeated-measures). In the story grammar: 
example, each S contributes one score to only one group. I he comparison ;s 
between groups. 

4. There is a normal distribution of scores in each group. That is, X and vari¬ 
ance are the best descriptions of the data. We used the absolute minimum 
(five per cell) in this example. If you plot out the actual scores, the polygon; 
is not that of a normal distribution. Even though ANOVA is said to be fairly 
robust in this regard, we recommend a much larger n size to guarantee that 
the sample distributions are not skewed. 

5. ANOVA assumes that the data in the respective populations from which the 
samples arc drawn arc normally distributed. Therefore, we have an inter¬ 
pretation problem similar to that discussed for the r-test (chapter 9). It is le¬ 
gitimate to compare nonnative speakers with several different NNS groups. 
However, care must be taken in interpreting the results. That is, large dif¬ 
ferences can be expected, but the differences may be due to language profi-: 
cicr.cy rather than to the variable being tested. In this type of design, it is 
impossible to randomly assign Ss to either native speaker or non native speaker 
groups, and so no causal claims can be made in such research. 

6 . There are equal variances of scores in each group. Wc assume this is the case 
since the design is balanced. If we use a computer program for the analysis, 
the printout will give us this information. 
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7. The design is balanced (if calculations are done by hand). In the example, the 
dependent variable was the test score and the independent variable was first 
language, a variable with five levels of equal n sizes. 

There is a minimum of 5 observations per cell. Some statisticians recommend 
a minimum of 10 observations per cell. It is often difficult to obtain a normal 
distribution with a small n size. That is, apparent differences within a cell 
may be due to a real difference or just to one highly variable observation 
which distorts the cell mean. In addition, with a small number of observa¬ 
tions, the power to reject H 0 is low, so the F must be very large to obtain 
statistical significance. 

The F statistic allows us to reject or accept the null hypothesis. If there is a 
statistically significant difference in the groups, it does not pinpoint precisely 
where that difference lies. (We will deal with this next.) 

(You might find it useful to review these assumptions by applying them to the test 
of composition methodology-page 326. Check to see if all assumptions were met 
f!h that example.) 

Let's look back at the example where a statistically significant difference was 
found between the groups. This is the comparison of performance of five differ¬ 
ent LI groups on a vocabulary test. We know that the groups did not perform 
in the same way on the test because we could reject the H a at the .05 level. 
However, we still do not know exactly where the difference lies. We cannot sim- 
Iply check the X for each group and decide. 

To discover the precise location of differences, we need to perform a post hoc 
-Comparison of the means. If no difference among the means was found in the 
ANOVA procedure, you can stop at that point. That is, it is not appropriate to 
search for differences among sets of means unless the F ratio is significant. When 
there is a significant difference among the means and we need to identify precisely 
fwhere that difference lies, a post hoc comparison is possible. 


itests for Locating Differences Among Means 

There are two ways to precisely locate differences among means. The first is by 
planning the comparison ahead of time. In this case there arc preplanned com¬ 
parisons and hypotheses to test. For example, in the vocabulary example (page 
318) we may believe that language groups I and 2 will perform quite differently 
than Ss in language groups 3, 4, and 5. This belief would be built on previous 
■ research or on strong theoretical arguments. Perhaps previous research has 
shown that the first two groups share many cognate vocabulary items with Eng¬ 
lish and that these two groups usually perform very well on English vocabulary 
tests. The preplanned comparison, then, builds on previous work and the hy¬ 
pothesis is directional (groups 1 and 2 will do better than the other three). The 
W 0 for the comparison would be that there is no difference in vocabulary scores 
Tor Ss whose languages are either 3-cognate or —cognate. The H l for a pre¬ 
-planned, directional comparison allows us to use a one-tailed hypothesis. 
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I he second way to locate differences is a post hoc companion of means. In this 
case the researcher has a //. of no difference among ihe five groups. After re¬ 
jecting this H„, it is still not clear exactly where the difference lies, so exploratory 
comparisons arc made between all the different groups or between some groups 
selected on a post hoc basis. I he analyses will be two-tailed tests. 

The two methods are sometimes called a priori for the preplanned comparisons 
and post hoc fot those carried out afiei the fact. There arc a wealth of computer 
procedures that make these comparisons. Some of them are best used to compare 
each group with every other group. Others, such as the Schcffc, allow very; 
powerful testing of grouped means against other grouped means. 

We will not present the computations for the Schcffe, the Duncan, the Tukey, or 
other multiple-range tests here since they are very complex. Sec Kirk (1982) for 
a discussion of a priori and post hoc comparisons using these tests (and for the 
formulas for computation). Most computer software programs will generate these 
on request. We will, however, present a table here to show the output from one 
such test, the Scheffe. 


Schcffe Test of Differences Across Language Groups 


Group 

Group 1 
Mean =4.05 

Group 2 
Mean = 3.30 

Group 3 
Mean = 2.90 

Group 4 
Mean = 2.75 

Group 5 

1 


1.70** 

1.75** 

2 .10** 

2.25** 

2 



1.45** 

1.50** 

2 .00** 

3 




1.15** 

1.30** 

4 





.40 


**p < .01 


I he chart shows which groups differ from each other. Notice the group number 
for the column and also for the rows. The value 1.70 compares group I and 
group 2 (it is at the intersection of these two groups on the table). The 1.75 is for 
group 1 and group 3. I he 1.45 is for group 2 and group 3, and so forth. Each 
comparison is significant with the exception of the comparison between group 4 
and group 5. It is also possible to group variables (e.g., groups 1 + 2 vs. groups 
3 + 4 + 5) for comparison using the Schcffe test. 

The combination of a One-way ANOVA_ar.d a Scheffe test allows us to discover 
whether there are differences among the Xs of different groups, and it also allows 
us to see, in a post hoc comparison, exactly where the differences lie. 


Strength of Association: omega 2 

There is another very nice feature that you can add to a One-way ANOVA that, 
like rj 2 for the Mest, will allow you to think about the strength of association. It 
will let you talk about the strength of the association in the data for balanced 
designs. That is, you can determine the proportion of the variability in the de¬ 
pendent variable that can be accounted for by the independent variable. This is 
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Called (o 2 (omega squared). The formula is very simple and you can easily do it 
by hand. 

2 SSB-(K- \)MSW 
“ = SST+ MS W 

Ihet's try this for our vocabulary test example. 

2 478.28 -(4X8.33) 

® 852.98 + 8.33 

2 _ 444.96 
W 861.31 

co 2 = .52 

This tells us that we can account for 52% of the variability in the vocabulary 
| SCO re simply by knowing the LI group. The strength of association of the inde¬ 
pendent variable is, thus, quite strong. As you will sec when we turn to corre¬ 
lation, a relation in the .40 to .60 range is quite impressive in dealing with this 
Itype of data. However, it leaves us wondering how we might account for the re¬ 
maining 48% of the variance! 

;:(The omega 2 formula can only be used for balanced designs. If the design is not 
|balanced, you can use the eta 2 formula which is presented in chapter 12.) 

fOnce you have found a statistical difference in a One-way ANOVA, you can re¬ 
ject the H a . A multiple-range test, such as Duncan or Scheffe, will show you 
precisely which means differ from each other. And, finally, once you have re¬ 
flected the H 0 , you can also discuss the strength of the relationship using co 2 . 

ooooooooooooooooooooooooooooooooooooo 

Practice 11.4 

► 1. Calculate and interpret omega 2 for the methodology problem on page 315. 
You will have to derive the SS from the MS and df. 


► 2. Calculate and interpret omega 2 for the composition problem in practice 11.3. 


looooooooooooooooooooooooooooooooooooo 
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Non parametric Tests for Comparing Three or More Groups 

When the assumptions of ANOVA cannot be met, there are still nonparametric; 
tests that can be used. The nonparametric equivalent to the one-way between-' 
groups ANOVA is the Kruskal-Wallis test. 

It is often the case that researchers in our field cannot meet the assumptions of 
ANOVA. It is true that ANOVA is quite robust to violations of its assumption 
of normal distribution. Often, though, we work with intact groups. 5s arc not 
randomly selected, and so we cannot be sure that our estimates of population 
variance, o 2 , are accurate. In addition, we may not be content with the "interval' 
nature of the data. In such cases, a nonparametric comparison of the data seems 
more appropriate than does ANOVA. 


Kruskal-Wallis Test 

There is one caution to keep in mind regarding the Kruskal-Wallis test. If you ■ 
have three levels and the n of any level is 5 or less, use an "Exact test" (sec Wike, 
1971) instead. 

Let's begin with an example and work through the procedure. Imagine that you 
wished ,o test Schumann's (1978) acculturation hypothesis, from all the adult, 
irnmigmit, untutored learners of F.SI. in your community, you have found seven 
people who can be classified as "acrolangs" (Schumann's terminology for learners, 
who have fossilized in syntactic development but at a fairly high level), eight 
rnesolangs'’ (roughly those who have fossilized at a medium level of syntax dc-1 
veiopment). and nine "basilangs" (learners who fossilized early in development of 
syntax). You have revised and improved a test which measures degree of 
acculturation. Some of these items are on 5-point scales, some arc ycs.no 
questions, some ask the learners to give opinions. Before we turn to the calcu¬ 
lations, please complete the following practice as a review tha*. highlights when a 
nonparametric procedure such as Kruskal-Wallis should be used. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 11.5 

1 . Is it likely that the data are equal-interval?_ 

Has rar.dornsclection of 5s been used?_. 

Would the Xs and variances obtained for each group allow you to make an ac-. 
curate prediction of population variance?_. 

Is the design orthogonal (i.e., is it a balanced design)?_. (Fortunately, 

in Kruskal-Wallis unbalanced designs do not complicate computation.) 

2. There is one dependent variable:_. There is one inde¬ 
pendent variable: _ with three levels (_, 
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_, _Is the comparison between-groups or 

peated-measures?_. 

| 000000'000000000000000000000000000000 

fhe data for the study began as scores on the acculturation measure. These have 
been changed to ranks as shown below. The lowest score has been given a rank 
of I, the next a 2, and so forth. The highest score is ranked 24ih. 


Score 

Rank 

Score 

Rank 

Score 

Rank 

18 

3.5 

12 

1 

18 

3.5 

28 

8 

16 

2 

21 

5 

32 

9 

37 

11 

26 

6.5 

46 

13.5 

40 

12 

26 

6.5 

52 

16.5 

46 

13.5 

33 

10.0 

62 

20 

52 

16.5 

51 

15 

63 

21.5 

61 

19 

53 

18 



63 

21.5 

68 

23 





70 

24 

nl =7 

T1 =92 

n2 = 8 

T2 = 96.5 

n3 = 9 

T3 = 111.5 


Fhe formula for Kruskal-Wallis follows. The numbers 12 and 3 are constants in 
the formula--that is, they do not come from the data but are part of the formula. 






h this formula, A — number of levels, 
o use the formula, we will go through the following steps. 


ivide 12 by N(N + 1). The N for our data is 24. 


N(N+ 1) 24(25) 600 


just £ symbol that tells us that the next operation is to be done for however 
any groups (a ~ group) we have. The following operation says to square the 
am of ranks for each group and divide by the n for the group. So the total in¬ 
duction for step 2 is: square each sum and divide by the n for the group; then 
total these. 
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(2 rrr2 

V t A 92 Z 96.5- 

A A'. 7 8 


8464 9312.75 12432.25 

7 8 9 

1209.14+ 1164.03 +1381.36 = 3754.53 


Step 3 

The formula says to multiply the result of step 1 and step 2. 


= .02 x 3754.53 = 75.09 


12 y t a 
A \N+ 1 )faN A 


Step 4 

Multiply 3 x (N + 1). 


3(;V + 1) = 3(25) = 75 


Step 5 

Subtract the result of step 4 from step 3. If you look at the formula, you will see 
that everything up to the 3(N + 1) was completed in step 3 and that 3(/V - l) was 
computed in step 4. 


12 


rZw7~ 3(JV+, > 


N{N+ 1)^N A 

H = .02 x 3754.53 - 75 
H = 75.09 - 75 
H — .09 


There is still one step left—that of placing the value of H in a distribution. The; 
distribution that we will use is that of a related statistical procedure. Chi-square.: 
table 7 in appendix C. If you consult that table, you will see that the degrees of 
freedom are listed down the left side of the tabic. The degrees of freedom for 
Kruskal-Wallis are computed using the number of levels minus I. In this study, 
the number of levels is 3, so the df 2. Across the top of the table you will find 
the probability levels. Since we have selected an alpha of .05 to reject the null 
hypothesis, we consult that column. The intersection of .05 and 2 df is 5.991. 
We need to meet or exceed this critical value, 5.991, in order to reject the null 
hy pothesis. 

We cannot reject the H 0 . Our // value is far below the critical value needed, 5.99. 

I he conclusion the researcher could draw from these results is that 5s at differing 
lewis did not show significantly different scores on the acculturation measure. 
If the observed level of // had been, say, 12.40, the researcher could have con¬ 
cluded that the 5's at different levels did get different scores on the measure. 
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£fere is the information given for the data using SPSS-X. Notice that the printout 
iso gives the mean ranks for each group. This is useful information that you 
ight want to give in a report. 


ACCULT 
BY LEVEL 


ACCULTURATION 
PROFIC LEVEL 


LEVEL 
NUMBER 
MEAN RANKS 
CASES 24 


12.06 
SIG .956 


l a research article, the statistical findings might simply be summarized as: No 
Jiffcrcnce was found among the three groups on the acculturation measure 
Kruskal-Wallis x 2 = .091, p = n.s.). You might report the mean ranks, but 
tot interpret them. That is, the mean ranks of 13.14, 12.06, and 12.39 differ nu- 
\erically, but they do not differ statistically. Therefore, it would be incorrect to 
&y that these three particular groups differ in any way on the acculturation 
rheasure. 

§000000000000000000000000000000000000 

Practice 11.6 

L. How would you interpret a significance level of .956? 


1*2. Now that you have a model from which to work, try to use the formula with 
he following data. Riggenbach and Lazaraton (1987) rated ESL students at four 
levels on their oral discourse skills. Five skills were tested, so the score represents 
& composite of five subtests. The total score, thus, is a composite of scale and 
core data. The students are from intact classes. The research question is 
hether students at different class levels (as determined by their scores on the 
SL Placement Test) differ in oral communication skills. Fill in the information 
elow regarding the study: 


Research hypothesis? _ 

Significance level? _ 

1 - or 2-tailed? _ 

Design 

Dependent variable(s)? _ 

Measurement? _ 

Independent variable(s)? _ 

Measurement? _ 

Independent or repeated-measures? _ 

Other features? _ 

Statistical procedure? _ 
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Level 1 


Level 2 


Level 3 


Mean 

Rank 

Mean 

Rank 

Mean 

Rank 

Mean 

Score 


Score 


Score 


Score 

21.5 

8.5 

30.5 

24.5 

16.5 

4 

35.5 

26.5 

16 

28 

19.5 

35.5 

31.5 

30 

32.5 

27.5 

30.5 

24.5 

28 

19.5 

29.5 

15.5 

2 

31 

26 

24.5 

13 

27 

18 

6 

18.5 

7 

29 

21 

32.5 

23.5 

11.5 

23.5 

11.5 

16 

3 

25.5 

23 

10 

17 

5 

34.5 

30 


10 

1 

25 

14 

27.5 

18 


21.5 

8.5 



34 

29 



nl = 9 


n2 = 8 


n3 = 9 


T1 

= _ 

T2 

= _ 

T3 

= _ 

T4 


Level 4 : 
Rank 

31.5: 
23 i 

22 : 

17: 

27.5 

15: 


n4 = 6 i 


Fill in the blanks for the totals above and then finish the computations using the 
formula below. In this procedure, it is important to carry out all calculations to 
four or five decimal places. (If you don't, your answer won't agree with the key!) 


12 


N{N+ 1) 


y t a 
L n a 


-3 (N+ 1) 


Complete step 1: 


12 


N(N+ 1) 


Complete step 2: 


flA. 

Na 


Complete step 3. 


H = 


N(N+\)£ x N a 


a rw- 2 

E T A 

Na ~ 


Complete step 4. 


3 (N+ 1) = 


Find the value of H. 


H = - 


12 


N(N + 1) j—* N a 


a rj~2 

yl±- 

^ N, 


3 (N+ 1)=. 
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0 w many degiees of fieedom arc there in this study?_. Cheek the oh- 

rved \alue of // against that shown in the Chi-square distribution table (table 

; appendix C). What is the critical value required to reject the //„? _. 

an you reject the Hf. _ 

that conclusion can you draw regarding this study? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

| with the r-test and One-way ANOVA, we can apply a strength of association 
It to the value of H in Kruskal-Wallis as well. The values needed for the for- 
lila are readily available in your printout (or hand calculations). 


retty complicated, huh?) 

.OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

met ice 11.7 

Imagine that in the test of the acculturation hypothesis, the value of H had 
rned out to be 20.09 instead of .09. This result would be significant. Using the 
ove eta 2 formula, comment on the strength of association connected with 
= 20.09. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

fanparametric Tests for Locating Differences Among Means 

ffhe Kruskal-Wallis test, like ANOVA, allows us to determine whether there are 
significant differences between groups. When significant differences are found, 
we need other tests to determine the precise location of the differences. The Ryan 
procedure is used as a follow-up when overall differences are found using 
ruskal-Wallis. We present the formula for the procedure here since there is only 
e nonpararneiric range lest and because it may not be readily available in other 
ferences. 


with Scheffe or other multiple-range tests used with One-way ANOVA, it is 
t appropriate to use a Ryan procedure unless there is a significant difference 
ong the groups compared in the Kruskal-Wallis. 
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Ryan Procedure for Ordered Data 


The way the Ryan procedure works is to order the medians of the groups. Then 
the procedure computes a series of Rank sums tests, testing them against a tabled 
value of 7. whose value is determined by three things: (1) the selected a for re¬ 
jecting the ff 0 , (2) the number of groups (i.e\, the number of levels of the inde¬ 
pendent variable) designated as "a," and (3) "d," which will be explained in a 
moment. If the value for the Ryan procedure turns out to exceed the table value 
for any of the differences between individual groups, the difference is significant 
at the .05 level. 

As an example, let's assume that you decided to replicate the Riggenbach and 
Lazaraton study the following quarter. You revised the tasks which made up the 
composite score and feel that you may be able to show differences across the four 
levels. You run the KruskaI-Wallis and find that, indeed, a significant difference 
does exist across the levels. The question, now, is to locate precisely where the 
difference(s) lies. To do this, you apply the Ryan procedure. The procedure be¬ 
gins by assigning a rank-order number to the scores. The individual with the 
lowest score receives a I, the next lowest a 2, and so forth. Here arc the fictitious 
data in rank orders. 


Total 

Median 


Level l 

Level 2. 

1 

2 

3 

8 

5 

9 

6 

12 

7 

18 

10 

20 

11 

24 

14 

28 

15 

31 

16 

33 

88 

185 

8.5 

19 


Level 3 

Level 4 

4 

13 

17 

17 

19 

21 

22 

23 

26 

25 

29 

32 

30 

35 

34 

38 

36 

39 

37 

40 

254 

283 

27.5 

28.5 


Now let's work through the steps needed to locate the significant differences. 


Step l 

First, in the column labeled a in table 6, appendix C, find the number of groups 
for your study. Then, for the selected a (the level of significance you have cho¬ 
sen - ). find the value of Z for each value of d 1 . The symbol d stands for the 
number of ordered levels spanned in the comparison. Since we have four groups 
we want first to compare the lowest and highest group. Level 1 has the lowest 
score and level 4 the highest. The span compared is four levels. So d 4 and 
d— 1 = 3. The critical level of Z for 3 df in table 6 is 2.64. When we compare 
level I and level 3, the span is 3 levels, so d 1=2 and the critical level of Z for 
2 df is 2.50. Comparing the lowest level I with level 2 is a span of 2, so 
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-1 = 1, and the critical value of Z is 2.24. These critical values work, for any 
ata with the specified number of groups. 


f : there are four or more groups, we can prepare a special table that will let us 
jimmarize the comparisons. To do this, compute the median for each group and 
rrange them in terms of increasing magnitude of the median both across the top 
tid down the side of the table. 


I 


tp l (8.5) 


XXX 


Gp 3 (27.5) 


tip 4 (28.5) 


(fompute a Rank sums test for each pair of medians indicated on the table (each 
&). If you have forgotten how to do this, review chapter 10. Since we are com¬ 
paring two groups each time, we use only the data from the two groups being 
Jdmpared. We start with the two ranks that arc the most different-level 1 and 
svel 4. First, rerank the two levels in relation to each other. The resulting ranks 
ill be from 1 to 20 since only these two are being compared. 

Level 1 has ranks 1, 2, 3, 4, 5, 6, 7, 9, 10, 11. 

Level 4 has ranks 8, 12, 13, 14, 15, 16, 17, 18, 19, 20 
The total of ranks for level 1 is 58. This is T f - the lowest level. 

The total of ranks for level 4 is 152. 

The Rank sums test for this comparison is: 

^ 2T, - n,(.V+ 1) 

/n,n 2 (A' + 1) 


? 2(58)-10(21) 

y(10X1QX2I) 


he 2 and 3 in the formula are constants (i.e., they are part of the formula; they 
o not come from the data). The Z value is greater than 2.64 (for a = .05, 
d— 1=3). Therefore, we can conclude that levels 1 and 4 differ significantly 
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ft 


from each other. I'he scores for group 4 are significantly higher than those of! 
group I. 

Now, let's compare level 1 and level 3. Levels 1 and 3 are the next most different 
across the range. Again, rerank the data for level 1 and level 3 in relation to each 


Level I has ranks of 1 ,2, 4, 5, 6, 7, 8, 9, 10, II. 

Level 3 has ranks of 3, 12, 13, 14. 15, 16, 17, 18, 19, 20. 

Total of ranks for level 1 = 63. 

Total of ranks for level 3 = 147. 

Now we can compute the Rank sums test for this data. Make sure you under¬ 
stand where each number in the formula comes from. 

_ 7 2(63) 10(21) V|? 

J (I0<1 0 X2I) 


The value of Z critical for ot = .05 and d I = 2 is 2.50. We can reject the null 
hypothesis of no difference between the medians of these two groups. We can 
conclude that levels 1 anti 3 arc significantly different. The scores in level 3 arc 
significantly higher than those in level 1. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice II.8 

> 1. The Z critical for the level 1 vs. level 2 comparison is 2.24. Compute Z. 
Can you reject the null hypothesis of no difference between level 1 and level 2? 
Why (not)? What conclusion can you draw?_ 


► 2. Do the comparisons between levels 7 and 3, 2 and 4, and 3 and 4 for the 
data above. Interpret your findings._ 


ooooooooooooooooooooooooooooooooooooo 

To review, there are a variety of multiple-range tests that can be applied when a 
One-way ANOVA is significant that will allow us to pinpoint precisely where the 
differences are. The Ryan procedure can be used for this same purpose when a 
nonparametric Kruskal-Wallis shows significant differences in comparing the 
performance of groups. 
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Activities 


1 . In the H. Riggenbach & A. Lazaraton study (1987. A task-based approach 
|b oral proficiency testing. Unpublished UCLA ESL Service Course research 
feport, UCLA.) the researchers also included an imitation task designed by 
Henning (1981). In this test, the 5s were asked to repeat longer and longer ut¬ 
terances. A possible score of 70 could be obtained on the test. Ss from three 
'different levelsjlow intermediate, high intermediate, and advanced) took the test. 
Here are the Xs and s.d.s for the three groups: 


Group 

Count 

Mean 

Std.Dev. 

Gp 1 

17 

45.9706 

13.4812 

Gp 2 

17 

52.1176 

11.6827 

Gp 3 

17 

59.7941 

13.5314 

Total 

51 

52.6275 

12.8984 


From this information, what statistical procedure would you use? Why? 


iftere is the ANOVA printout from this study: 


Source 

D.F. 

55 

MS F ratio 

Prob. 

Bctween-Gps 

2 

1633.8922 

815.4461 5.2022 

.0090 

Within-Gps 

48 

7524.0294 

156.7506 


Total 

50 

9154.9216 




Scheffe Post Hoc Comparisons Table 


Groups 


Mean 

Mean 


1 vs 3 


59.794 

45.971* 


l vs 2 


59.794 

52.118 


2 vs 3 


52.118 

45.971 



* p < .05 


Interpret the tables. What conclusions could you legitimately draw from the 
analysis? If you were asked to review this study for Language Learning, what 
(Suggestions would you make to the authors? Explain why your options would 
improve the study. 

|2. M. Ghadessy (1988. Testing the perception of the paralinguistic features of 
spoken English. IRAL, 26, 1, 52-61.) gave a listening test to 307 5s at the Na¬ 
tional University of Singapore. The students listened to a voice saying a state¬ 
ment such as "Come up and see me sometime" as they read the statement. To test 
[their perception of the suprasegmental features overlaying the spoken utterance, 
[5s then selected one of four possible completions such as (a) she said gratingly, 
[(b) she said huskily, (c) shrieked the miserable woman, (d) said Mary sternly. A 
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One-way ANOVA war* done to see if LI group made a difference in perccptiua 
of paralinguistic features. The means of the five groups were: 

Group Mean 

Chinese 11.50 

Malay 11.76 

Tamil 12.14 

English 12.21 

Other 12.50 

The difference in means for the groups was not significant at the .05 level. That 
is, the groups did not perform differently. Imagine that you believe that Ss from : 
all language groups are equally able to interpret the meanings of speakers' 
paralinguistic features. Therefore, you arc not surprised at this lack of difference! 
among the various LI groups. To test this hypothesis, you might wish to replicate 
this study but with Ss from other language backgrounds. How might you design 
the study in such a way that it would be possible to combine your results with 
that of the 307 5s in this study? If you were to replicate the study but could not: 
obtain access to the tapes used in the experiment, what changes might you make 1 
in redesigning the study? 

3. 1.. Goldstein (1987. Standard English: the only target for nonnative speakers; 
of English? IF.SOL Quarterly, 21, 3, 417-436.) examined the implicit unstated 
assumption of much of SLA research that there is a homogeneous, standard-' 
English-speaking target language group which nonnative speakers come in con¬ 
tact with. In particular, Goldstein looked at the language use and attitudes of 
2S Hispanic boys who were exposed to Black English in the New York City area. 
Data were collected via interviews and test measures. Linguistic data came from 
the former; two linguistic variables were isolated: negative concord within the 
clause (e.g., "I don't have none") and distributive be (c.g., "1 be here"). Negative 
concord was quantified as the number of times it occurred divided by the number 
of times it could have occurred while distributive be was tabulated as the number 
of times it occurred. These two linguistic phenomena became dependent vari¬ 
ables. Two independent variables were examined: (1) extent of black contact- 
extensive (n = 4), medium (n - 6), limited (n = 8), and none ( n = 10), (2) 
reference group-black (n = 5), white (n = 23). Onc-wayANOVAs were run for 
the linguistic variables to see if differences among the Xs of the four levels of 
"extent of contact" groupings were significant. The following two tables are pre¬ 
sented: 

Negative Concord and Extent of Black Contact 
Source SS df MS F 

Between 11836.97 3 3945.66 3.32* 

Within 28542.04 24 1189.25 

p < .05 


342 The Research Manual 




Distributive Be and Extent of Black Contact 
Source SS df MS F 

Between 14.803 3 7.321 25.54* 

Within _ 7.162 _24__.298_ 

p < .01 

Based on the statistical analyses and an examination of data from individual Ss, 
the author concluded that extent of contact is a necessary but not sufficient con¬ 
dition for acquisition of these two linguistic forms. Which strength of association 
test should be used (a> 2 or rj 2 )‘l What is the strength of association? What sta¬ 
tistical procedure could be used to check for the effect of the second independent 
variable, reference group, on these two linguistic forms? What strength of asso¬ 
ciation test would you use in that case? 
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Chapter 12 

Repeated-Measures 
Comparison of Three or More 
Groups 

• Parametric comparisons: Repeated-measures one-way AN OVA 

Interpreting the F ratio 
eta 1 for repeated-measures 

• Nonparametric comparisons: Friedman test 

Strength of association: eta 2 
Netnenyi's test 

flit the previous chapter we compared the performance of three or more groups 
livhen the data came from different data sources. That is, data from different 5s 
Appeared in each group. The comparison was between groups. In this chapter 
live will compare three or more groups when the data are taken from the same 
fdata source. That is, data are taken from the same 5s at different points in time 
|6r on a set of different tasks at one time. The data are not independent. 

fAs in the previous chapter, we will begin with the parametric procedure and then 
friove to the nonparametric equivalent. 


Parametric Comparisons: Repeated-Measures One-way ANOVA 

We can compare the Xs drawn from the same 5s on several different measures 
||or the same measure at several different times) using a modified ANOVA pro¬ 
cedure. The data must meet the assumptions of ANOVA (just as in the last 
Chapter). These assumptions include: 

1, There is one dependent variable and one independent variable with three or 
more levels. 

§2, The data are from the same 5s (repeated-measures). 

3. The data have been measured as ordinal scales or interval scores (continuous 
measurement). The distribution of data in the ordinal scales is appropriately 
captured by X and variance. 

|4. Scores in each sample are distributed such that X and variance are appro¬ 
priate measures to use in the analysis. 


Chapter 12. Repeated-Measures Comparison of Three or More Groups 345 



5. The data in the population from which the samples were drawn is normally: 
distributed with equal variances. 

6. The design is balanced (if calculations are done by hand). 

7. 1 here is a minimum of live observations per cell (and more is better). 

8. The F statistic allows us to reject or accept the null hypothesis. If there is a 
statistically significant difference in the samples, it does no: pinpoint precisely 
where that difference lies. (We will deal with this later.) 

The regular ANOVA procedure will have to be modified. This is because A 'bars 
taken from the same 5s will resemble each other more than those taken frorr. : 
different Ss. 

Let's use an example and work through the amended procedure. Irujo (1986) has 
written a report on the acquisition of idioms. Imagine that we. too, plan to carry 
out research on idioms in a second language. As part of this research, we firs; 
wish to replicate Irujo's study to see whether 5s would perform differently with 
three types of idioms. The idioms Irujo selected were those identical to idioms in 
Spanish, similar to those in Spanish, or completely different from Spanish idioms. 
One of the measures used was a multiple-choice task where native speakers of 
Spanish selected the appropriate paraphrase of an idiom. Here are the X and s 
for each of these three subgroups of idiom type in Irujo's data: 


Idiom Type 

Multiple-Ch 

Identical 

Mean 

14.58 

SD 

0.79 

Similar 

Mean 

14.67 

SD 

0.65 

Different 

Mean 

12.25 

SD 

2.01 


F(2, 22) = 15.45 (p < .001) 


Note that the numerical difference among the means does not appear to be great. 
Remember, though, that we expect the same 5s to perform more similarly on 
different tasks than would different 5s. Also note the differences in spread of 
scores (SD is used here as the abbreviation instead of s) for the three idiom types. 
This difference in variability from the X for the three measures should seem rea¬ 
sonable. l.ess variability would be expected for identical and similar idioms: the 
problem arises when idioms arc different. Notice the two numbers in parentheses 
following F. These indicate the number of degrees of freedom. The first number 
refers to the number of levels of the independent variable. For regular one-way 
ANOVA, this would be the number of groups; for Repeated-measures ANOVA 
it is the number of tasks or times data were collected. Since here each 5 had a 
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icore for three different types of idioms, there were three tasks and, therefore, 2 
| if. Don't worry now about how the second number was obtained. (If you can 
Pguess, that's fine.) The probability reported for the F value tells us that we can 
Ajave great confidence that there is a statistical difference somewhere in the data 
^though we still have not located precisely which pairs of X values differ from 
|each other). If you had read the abstract of this article and then interpreted the 
table, you would now be ready to read the total article, checking your predictions 
Igainst the author's interpretations. 

We don't have the actual data so we cannot rerun the analysis using a one-way 
ANOVA with repeated-measures. However, let's use some fictitious data. Let's 
Imagine that we carried out a replication study using the second of Irujo's tasks. 
Twelve randomly selected Ss (the same n size as in the original study) were asked 
to give a definition for idioms which were classified in the same way (as identical, 
similar, or different from those in the LI). Each student received a score for the 
tnumber of correct definitions. 

lAs always, wc begin with stating the H 0 . 

There is no difference in accuracy of definitions given by Ss to three 

types of idioms (which vary in similarity to idioms in the LI). 

ft he study is a replication study (using the same materials as those in the original 
study and with Ss from the same LI group); we will select an .05 level of signif¬ 
icance for rejecting the H 0 . While the study replicates Irujo's original research in 
fmost ways, the proficiency level of the Ss is different. Irujo's Ss were "advanced" 
learners and our Ss are randomly selected low-intermediate Ss. This is a repli¬ 
cation study, but one thing has been changed. Would you argue for a two-tailed 
Test of significance rather than a one-tailed test? This is a choice the researcher 
|needs to justify; there are no strict guidelines that bind you to one or the other. 
fWe would argue for (but not insist on) a two-tailed hypothesis on the basis that 
The Ss are quite different and we would need a two-tailed .05 (.025 from each tail) 
To feel confident that we do not err in rejecting or accepting the null hypothesis. 

We would like a large sample size and balanced design to be sure we don't violate 
the assumptions of normal distribution and equal variances. However, since the 
test is fairly robust when the design is balanced, it will certainly make our com¬ 
putations easier if we use a small n to illustrate the procedure. 

i! Whatever our findings, we want to expand the population to which Irujo's de¬ 
scription might apply. The characteristics of the population covered by the ori¬ 
ginal study and this replication are alike in terms of LI and n size. Proficiency 
level differs. Obviously we still will not be able to generalize the findings beyond 
The population from which Ss in the two studies were drawn. That is, we cannot 
make claims about LI and L2 idioms for other ESL groups. Further replications 
; with other LI and L2 groups would enlarge the scope of generalizability. 
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Here are the data: 


5 

Identical 

Similar 

Different 

Total 

n 

1 

13 

10 

8 

31 

3 

2 

12 

10 

6 

28 

3 

3 

14 

11 

5 

30 

3 

4 

12 

9 

6 

27 

3 

5 

14 

8 

8 

30 

3 

6 

12 

8 

3 

23 

3 

7 

14 

11 

4 

29 

3 

8 

12 

10 

6 

28 

3 

9 

9 

9 

3 

21 

3 

10 

13 

10 

7 

30 

3 

11 

14 

8 

5 

27 

3 

12 

12 

10 

9 

31 

3; 


Totals 151 

114 

70 

G - 335 



(12) 

(12) 

(12) 

(N - 

36) 


The first step in carrying out the procedure is to arrange the information above' 
in a way that will give you the information that you will need for ANOVA. Alter 
the three data columns, there is a column marked "Total"; this is the total score 
for each S. So, for example, 51 had scores of 13 t 10 • X for a total score of 
31. 

Second, we added all the scores in the column labeled "Identical," (13 ! 12 i 
14 t- 17. etc.) and placed the total at the bottom of the column. Let's label this 
column total T AX . The "subscript A" shows that this Ls a total for a group (re¬ 
member "A" and "K" arc often used to represent groups with "A" more commonly 
used for within-group levels), in this case the first group, "Identical Idiom." We 
can label the totals of the other two columns as T A1 and r A >. 

To help keep the n figures straight, wc put a number 3 next to the individual total 
scores (the total at the far right) for each 5. This is n s . We could also place a 
number 12 in parentheses at the bottom of each column to show the total number 
of scores in this direction too. This is n A . 

This starts us with step 1 in the ANOVA process. 

Step l 

Find T M T Al T A3 . 

We have already done this and entered the totals at the bottom of the columns. 
T a = 151, T i2 = 114, T a2 = 70. The n for number of observations in each sample 
is 12. 

Step 2 

hind T s . T s: T S3 ... T Sn 
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tVVc have already done this too. By adding the scores of the first S on the three 
lixieasures, we have Tsv The little subscript S shows the totals are for 5s rather 
Than for types of idioms. 

find the n for number of scores. Each 5 has three scores, so the n s is 3. 

■Step 3 

Find G. Sum all the scores. You can do this most easily by totaling 
: \t A \ + T a 2 + T av G = 335. 

At this point you could "check" if you like to be sure G is correct by adding each 
S 's total score (in the far right column) and see if the two figures for G are iden¬ 
tical. Fortunately, they are. If they were not, this would be the place to stop and 
echeck the calculations. 

ind N. Count the total number of scores. This number is N. We have recorded 
it in parentheses next to the figure for G on the data chart. N — 36. 


Step 4 

find G 1 ~ N. 

To do this step, we square G and divide this value by N. G 2 = 112225 and 
fjP + N= 3117.36. 

Step 5 
find Z* 2 - 

To do this, we square each raw score and sum these squared values. 


Step 6 
find SST. 


sst = yv 


fhis is easy since it asks us to subtract the results of step 4 from step 5. The 
answer is 351.64. This number must be positive. G 2 -r N must be smaller than 
dr equal to Z^ 2 * This is a check to make sure everything is okay so far. If the 
number is negative, you should stop and find your error. 

|Step 7 

And [ X Tfi - n A- 
1 

To do this, we first square each of the column totals and then sum these three 
squared values. 

T A\ + T A2+ 7^3 = 40697 
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And, now, divide this by the number of scores in each column (i.c., 12). The 
answer should be 3391.42. 


40697 

12 


3391.42 


Step 8 

Find SS A . This is the sum of squares between the groups (SSB). 

a 2 

ss A = lY J Tfo* n A-ir 

A= 1 


To do this subtract the result of step 4 from that of step 7. 

SS A = 3391.42 - 3117.36 = 274.06 

The = 274.06. 

Step 9 
Find: 

”, 

To do this, wc first square each of the total sort's for each individual subject and 
sum them. 

Jl 3 I 28 3 + 30 2 i - + 31 3 9459 

Then we divide by the number of scores for each S which in this case is 3. 

9459 - 3 = 3153 

Step 10 

Find 55 s This is the sum of squares within (SSW). 

ss s =t T l* 

S= 1 

To find SS s , subtract the figure found in step 4 from the figure found in step 9. 
3153 3117.36 = 35.64 

Step 11 

Find SS AS = SS Tolal - SS A - SS s This is the sum of squares for the interaction 
of types of idioms and subjects (SS HW ). 

Subtract the resulting figures: step 6 minus step 8 minus step 10. 
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351.64 - 274.05 - 35.64 = 41.94 


Sou' let's stop and sec how these computations fit into an ANOVA table. 
Here is the ANOVA table for our study. 


One-way ANOVA with Repeated-Measures 


■ Source 

df 

55 

MS 

F 

Idiom Type (A) 

2 

274.06 

137.03 

71.74 

Subjects (S) 

11 

35.64 



Types X Subjects (A X S) 

22 

41.94 

1.91 


Total 

35 

351.64 




|The first column in the ANOVA table identifies the variables: idiom type, sub¬ 
jects, ar.d the interaction. The interaction is read as a "types by subject inter¬ 
action with X used as the symbol for by. 

The next column gives the degrees of freedom. I he df for idiom type is easy. 
There were three types, so 3 - I - 2 df There were 12 .S's, so the df for .Vs is 
12 1-11. Next, you will find the df for the interaction. 1 he df for interaction 

arc (</ I) x (s 1). The symbol a is for the number of levels of the independent 
variable idiom type. There were three types, so a - 1 = 2. The symbol v is for the 
subject variable. There were 12 .Vs, so s - I - II. The total df for the inter¬ 
action, therefore, is 2 x 11 — 22. (If you had problems understanding the (2, 22) 
figures in the Irujo study, the source of both these figures should now be clear.) 

The figures that go in the SS column arc the answers to the computation step-* 
we’ve just completed. Step 8 gave us the sum of squares for types, SS A . Step 10 
gave us the sum of squares for subjects, 55 s . Step 11 is SS AS . We have inserted 
each of these values into the above table. The SS I value found in step 6 goes in 
the tota'. row of the SS column. 

The MS column was completed as follows. The df for type of idiom is 2, so we 
can divide the SS for type of idiom by 2. The answer is 137.03. The MS value 
for type by 5s can be obtained by dividing the SS A5 by 22. The answer is 1.91. 

You will notice that we have not computed the MS for subjects. In ordinary 
ANOVA we expect 5s to differ from each other and so the MSW figure is used 
to determine the F ratio. In Repeated-measures ANOVA this is not the case. 
We are not interested in how 5s differ from each other, but in how each 5 differs 
ion the repeated-measures. So, the F ratio is found by dividing MSB by MS BIV . 

The final step in computing the F ratio is step 12. 

Stop 12 

To find the F ratio, we divide: MS A 4- MS as These figures are in the summary 
table. The F ratio is 71.74. 
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Now, as always, wc place this value in a distribution of /•' ratios with the same 
df. Since the F ratio in this one-way Repeated-measures ANOVA is computed 1 
by dividing the MS for idiom type (A) by the MS of the interaction (AS), the two 
df we must consult are those for type (2) and for the interaction (22). flic F 
distribution table is used for both bet ween-groups and within-group designs. The 
repeated-measures ANOVA formula yields an F ratio which has the same distri¬ 
bution properties as that for between-groups designs. In the F distribution table 
(tabic 5, appendix C), find the intersection of 2 and 22 df. If the obtained F ratio 
is larger than the critical value given in the table, we can reject the H 0 . We must 
choose between .05 or .01 levels of probability. Prior to the study, we selected .05 
as our critical level, so F crt( — 3.44. We can reject the H 0 because our F otfS is greater 
than F (m 

After all this work, wc would hope that there would be space in the report to in¬ 
clude not only the X and •? for each idiom type but the ANOVA table as well 
However, all this information might be summarized in the statement: F = 71.74 
(2, 22), p < .05. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 12.1 

1. How will you interpret the findings? Do they support those reported by Irujo? 
Can you enlarge the descriptive scope of the original study? And the population 
to which the findings might generalize? Why (not)?_; 


2. A parametric ANOVA was chosen for this study because the data were 
scores. Do you believe the data met the other requirements for using ANOVA? 
Why (not)? _ 


3. ANOVA is reputed to be quite 'robust ' to violations of the assumptions on 
which it is based, particularly that of normal distribution and equal variance. 
That is, if there is a violation of this assumption and you find you can reject the 
H 0 at the .05 level, the actual probability may not be far off. It might be in the 
.07 or .09 range. Would it concern you if you rejected the H 0 at the .05 level and 
then found there really were 7 or 9 chances (rather than 5) in 100 that you are 
wrong in rejecting the Hf. Why (not)?_ 


4. A study of reading scores of students on five tests reported an F ( 4, 116) = 
1.036 (p = n.s.). Explain the numbers 4, 116. How would you interpret the F 
ratio? 
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:> 5. Imagine you teach three classes of 15 students each. They are different 
Sections of the same class, so 5s are assumed to be at the same level of profi¬ 
ciency. You want to see which reading lessons have the most beneficial effect on 
[biweekly reading quizzes. The first week you work on scanning, then skimming, 
['reading for the main idea, prereading skills, and finally guessing vocabulary from 
Context. You randomly select 9 5s from the classes. The test results are listed 
[below. Do a one-way ANOVA with repeated-measures. 

Fictitious Reading Test Scores 


Week 

2 

4 

6 

8 

10 

SI 

24 

29 

33 

42 

37 

S2 

25 

27 

36 

44 

43 

S3 

28 

38 

29 

47 

48 

S4 

21 

26 

34 

30 

39 

S5 

27 

22 

31 

40 

35 

S6 

26 

27 

45 

36 

46 

S7 

10 

18 

20 

21 

31 

S8 

34 

36 

39 

50 

41 

S9 

49 

42 

53 

52 

64 


[Show your calculations below. Can you reject the H 0 2 How would you interpret 
(the results? 


||>oooooooooooooooooooooooooooooooooooo 

I interpreting the F Ratio 

The F ratio allows us to reject or accept the null hypothesis. If the F ratio is 
statistically significant we know that the means that have been compared are 
Significantly different. However, it does not allow us to locate the difference 
precisely. If a researcher compares group performance at five different time pe¬ 
riods and obtains a statistically significant difference, it is possible that the dif¬ 
ference might be between time 1 + 2 vs. 3 + 4 + 5, between time 1+2 + 3 
vs. time 4 vs. time 5, or many other possible combinations. In order to pinpoint 
fthe location of the difference, the researcher can use a multiple-range test. Again, 
the Scheffe, Tukey, Newman-Keuls are commonly used for this purpose. We ask 
that you consult a computer manual for more details on the actual statistics for 
fthese tests. However, for reading practice, we have included a SAS printout 
ishowing a multiple-range comparison for the data in practice 12.1.5. 
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Scheffe Test for Comparison of Means 
Minimum significant difference = 6.86. 

Means with the same letter are not significantly different. 


Scheffe 

Grouping 

Mean 

N 

Week 


A 

42.667 

9 

5 

B 

A 

40.222 

9 

4 

B 

C 

35.556 

9 

3 

D 

C 

29.444 

9 

2 

D 


27.111 

9 

1 


The table shows that performance at week 5 is better than for weeks 1, 2, and 3 
(but not week 4). Performance at week 4 differs from that at weeks 1 and 2 (but 
not weeks 3 and 5). There is no difference between weeks 3 and 2, but week 3 is 
higher than week 1. There is no difference between weeks 1 and 2. 

The ANOVA showed that student performance varies across the testing periods 
and the Scheffe shows that the difference is statistically significant only between 
certain time periods. The differences, however, arc not distinct between adjoining 
weeks. Rather, there is an overlap, so that performance levels at weeks 1 and 2 
arc not different, weeks 2 and 3 are not different, weeks 3 and 4 are mu different, 
and 4 and 5 arc not different. This makes it difficult to make a connection be¬ 
tween the specific skill taught and test performance. Even il there were signif¬ 
icant differences week by week, we could not claim a causal relationship between 
specific skill instruction and test score improvement. If we wanted to check to see; 
if weeks 4 r 5 differ from weeks I +2-3, we could do another Scheffe to 
make this comparison. If the result were significant, wc might be able to show 
that the treatment needs three weeks before it "takes effect." Otherwise, the 
scores rr.ay increase simply due to more practice time (no matter what the type 
of piau'cc). To provide a sliong link between instruction and skill improvement, 
we would need to redesign the study with appropriate control groups. 

In this study, we first found a significant F ratio using ANOVA. We then turned 
to the Scheffe to show us where the differences were located. There is. however, 
a strange phenomenon that sometimes occurs. The F ratio is statistically signif¬ 
icant, but when other multiple-range tests (e.g., Duncan or Tukey) are run, some 
of them do and some do not locate significant differences among the levels. This 
occurs because some multiple-range tests are more "conservative" than others. 
The Scheffe is notoriously "conservative"-that is, it is more demanding than some 
of the other multiple-range procedures in locating statistically significant differ¬ 
ences among means. It's unlikely that it would ever allow you to err in rejecting 
the null hypothesis. If you run several multiple-range tests anc they do not agree, 
please discuss this issue with your advisor or statistical consultant so that your 
interpretation is well-informed. 


eta 2 for Repeated-Measures 

It is not possible to use the omega 2 strength of association test with Repeated- 
measures ANOVA. Instead, we use a special eta 2 formula to test the degree of 
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Association between the variables. (This special formula can also be used for 
Aetween-groups comparisons with unbalanced designs.) The formula for eta 2 in 
iRepeated-measures ANOVA is: 



ffhe formula just says to take the sum of squares of the factor you are interested 
In and divide by the sum of squares total. For the reading study mentioned in 
practice 12.1.5, eta 2 would be: 

„2_ SS types 
n SST 

2 _ 1615.11 
n 5220 

rl 2 = .309 

|Tfhe relationship shows that 31% of the variability in the exam has been ac¬ 
counted for (in these data) by the five different reading lessons. This is a healthy 
Relationship (with real data the association would not likely be this high). 

If you used omega 2 as the strength of association measure with these data, you 
/Would obtain a lower association; omega 2 gives a more conservative estimate than 
;eta 2 . However, with repeated-measures or unbalanced designs, remember to use 
|he appropriate test of association--eta 2 . 

fooooooooooooooooooooooooooooooooooooo 

Practice 12.2 

> 1. For the Irujo replication on page 347, calculate eta 2 for idiom type. 
f*ta 2 =_. 

Interpretation: _ 


|00000000000000000000<X>000000000000000 

Nonparametric Comparisons: Friedman Test 

If you cannot meet all the assumptions of ANOVA (whether for descriptive or 
v inferential purposes), then you would be better served by a nonparametric test 
such as the Friedman. The Kruskal-Wallis test parallels one-way ANOVA where 
/the comparisons are between groups. A parallel nonparametric test for a 
■: Repeated-measures ANOVA is the Friedman test. 
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There arc some cautions to remember when using the Friedman test. With three- 
levels, you need at least 10 observations per group. With four levels, you need 
at least 5 per group. Let's work through the formula, now, with an example that 
meets this requirement. 

One of the objectives in an advanced ESL composition class was that of persua¬ 
sive discourse. The ESL teacher decided to use a current topic which concerns 
both teachers and students-whether separate, special composition classes for 
ESL students should be offered for this advanced level. So. the topic for the 
persuasive discourse unit was whether nonnative speakers of English should take 
this advanced course with native speakers ir. a regular English class or whether, 
it should be offered as an ESL course. The students first discussed the issue and 
then voted to show their opinion of taking the course in the regular English pro¬ 
gram. The vote is on a 7-point scale. Second, the students read a short article 
in the campus newspaper which argued that nonnative speakers would be better 
served taking a course especially geared to their needs. A second \ote was taken 
at this point. Next, a foreign student graduate working at IBM came to the class 
and argued forcefully for taking the course with native speakers. A third vote 
was taken. Finally, the students listened as the teachers of the two courses went 
over the course syllabi for the regular English class and the ESL class. A final 
vote was taken. The research question is whether the votes remained constant 
over time (impervious to persuasion) or changed according to the arguments 
presentee at each time period. 


Research hypothesis? There is no difference in votes due to type of 
persuasion (e.g., vote 1 = 2 = 3 = 4). 

Significance level? .05 
1- or 2-tailed? 2-tailed 
Design 

Dependent variable(s)? Votes 

Measurement? 7-point scale 

Independent variable(s)? Persuasive argument 

Measurement? Nominal--4 levels (discussion, news article, guest 

speaker, syllabi) 

Independent or repeated-measures? Repeated-measures 
Other features? Intact group 
Statistical procedure? Friedman 


The data are displayed in the table below. The first column identifies the S. At 
time 1,61 gave a vote of 5 on the /-point scale; at time 2, 61 gave a vote of 2; 
at time 3 a 4; and at time 4, a vote of 1. This 6s lowest vote was a 1 and so it is 
assigned a rank of I; the next lowest was the vote at time 2, so this is ranked 2. 
I he next rank is for time 3 with a vote of 4, and the highest rank is for time 1 
with a vote of 5. Thus, there is a rank order within each 6"s votes. The ranks 
are across each row, not down each column. 
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Scores Across Four Time Periods 


5 

Time 1 


Time 2 


Time 3 


Time 4 



Score 

Rank 

Score 

Rank 

Score 

Rank 

Score 

Rank 

il 

5 

4 

2 

2 

4 

3 

1 

1 

l 

6 

4 

4 

2 

5 

3 

2 

1 

;3 

7 

4 

3 

2.5 

3 

2.5 

2 

I 

"A 

7 

4 

4 

2 

5 

3 

3 

1 

\5 

6 

3.5 

5 

2 

6 

3.5 

2 

1 

6 

4 

1 

5 

2.5 

5 

2.5 

6 

4 

;7 

5 

3 

4 

1.5 

6 

4 

4 

1.5 

8 

7 

3.5 

6 

2 

7 

3.5 

3 

1 

9 

6 

3.5 

5 

2 

6 

3.5 

3 

1 


T1 

= 30.5 

T2 

= 18.5 

T3 

= 28.5 


T4 = 12.5 


The Friedman formula doesn't start with "Friedman equals." Rather it starts 
with "Chi-square of ranks" 0$), perhaps because the distribution table to be used 
is the Chi-square table. We will present the Friedman formula as a scries of 
steps. The numbers 12 and 3 are, again, constants in the formula. 




Step 1 

Sum the ranks for each time. Time 1 
time 4 = 12.5. 


; 30.5, time 2 = 18.5, time 3 = 28.5, and 


Step 2 

Square each sum of ranks and total these. 


J T\ = 30.5 2 + I8.5 2 + 28.5 2 + 12.5 2 

A^l 

= 930.25 + 342.25 + 812.25 + 156.25 
= 2241 


Step 3 

Divide 12 by the number of levels (a), times number of observations per level 
0), times levels plus 1 (a + l). 


12 


12 


_ = _ 12 

(aX*X«+l) 4x9x5 180 
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Step 4 

Multiply the results of steps 2 and 3. 


(«X*X« + l)f-» 

/l=i 


V .066x2241 = 147.9 


5rep J 

Compute 3 times 5, the number of observations per level, times a - I, the num¬ 
ber of levels plus 1. 

3(5X<* + 1) = 3(9tf5) = 135 

Step 6 

Subtract the result of step 5 from that of step 4. 

X 2 r -= 147.9 - 135 - 12.91 

Now that we have the value of x 2 , wc can check the critical value needed for the : 
appropriate degrees of freedom in the Chi-square distribution table (table 7 in 
appendix C). This time, the df is the number of levels minus l. In this study the 
number of levels was 4, so the degrees of freedom for the study is 4 l - 3. 
Since we have selected an a of .05, and our df is 3, wc look at the intersection of: 
column 2 and row 3 to find the critical value. 

/- critical for 3 df is 7.81. Wc can reject the H„ and conclude that 5s' ratings did 
change over time as the persuasive discourse changed. We can show that the: 
sealed votes vary from time to time and wc can show, numerically, how they dif¬ 
fered. However, we cannot say which kinds of persuasive discourse were most 
effective in influencing votes on the basis of the statistical procedure. I he proce¬ 
dure tells us that we can feel confident about saying that the scores do, indeed, 
differ, and we show that confidence by displaying a simple statement: Friedman 
X 2 = 12.91, 3 df p < .05. 


Strength of Association: eta 2 

When the outcome of the Friedman allows us to reject the null hypothesis, wc can 
then turn to a test of association to discover how strong a relationship exists. The 
strength of association formula for the Friedman is eta 2 : 


In the above formula, x* is the statistic you calculated for the Friedman. The 
N* is the number of 5s times the number of observations on each 5. The calcu¬ 
lations are simple. Let's apply it to the example above: 
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r.R 

/V* - 1 


2 12.91 

n (9x4)- I 

„ 2 =.37 

While the Friedman test told us that votes did change over time as the persuasive 
discourse changed, eta 2 tells us that the strength of association between persua¬ 
sive arguments and voting behavior for these data was really quite strong (we 
should be pleased). We've made an excellent start in understanding the re¬ 
lationship between persuasion and voting behavior on this issue. However, per¬ 
suasion alone doesn't cover all the variability in the data. "Error" (normal 
variability in performance and that due to other variables yet to be researched) 
is still large. That should suggest redesigning the study to include other variables 
to see if we can enlarge our view as to what influences attitudes (expressed as 
votes) on the issue of separating or integrating native speakers and nonnative 
speakers in composition courses. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

§Practice 12.3 

ft. Following this exercise, the teacher asked the students to write an explanatory 
Jessay on why different types of evidence werc/weren't effective as persuasion. 
How might this become a new research project? What is a candidate research 
question?_ 


>2. For the problem below, please compute the value of x\- We will give you 
the answer based on a computer run of the same data. Carry out the calculations 
and see whether the computer gives you exactly the same values. 

People often say that immediate feedback is valuable for correcting errors. We 
wondered if perhaps such feedback might simply make people more anxious (and 
thus lead to more error). To test this, we gave three different computer assign¬ 
ments, each of which had two parts. For the first assignment, we sat beside stu¬ 
dents and each time they were about to make an error, we stopped the 5s and 
gexplained what was wrong. In the second assignment, we sat beside students and 
helped to correct errors after they occurred. In the third assignment, we let the 
students get error messages from the computer after they ran the program. Each 
of these assignments had a second part where the Ss did a similar problem on 
iheir own. There were 10 possible errors in each exercise. A perfect score would 
fbe 0 and a score of 10 would show that every possible error occurred. These 
Scores became the data for the study. 
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Research hypothesis? _ 

Significance level? _ 

1- or 2-tailed? _ 

Design 

Dependent variable(s)? _ 

Measurement? _ 

Independent variable(s)? _ 

Measurement? _ 

Independent or repeated-measures? 

Other fea tures? _ 

Statistical procedure? _ 


The data are from 18 Ss. 

5 Tl 

1 2 

2 4 

3 3 

4 1 

5 3 

6 6 

7 9 

8 1 

9 3 

10 4 

U 2 

12 2 

13 3 

14 2 

15 5 

16 3 

17 6 

18 2 

Totals _ 

Place your calculations in the space provided below. ( Warning : Remember that 
this time, the best score is the low score. The worst possible score is a 10. If there 
were someone with a score of 10, they would receive the rank of 1.) 
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Here is the computer output (using the SPSS-X program). 
Friedman Test 

TS TEACHER STOPS 

TC TEACHER CORRECTS 

DC DELAYED CORRECTION 



TS 

TC 

DC 

MEAN RANKS 

1.80 

1.64 

2.56 

CASES = 18 x 2 

= 8.583 

D.F. = 2 

p = .014 


:i;DKl your calculation agree with the computer printout? If not, where do they 
differ? 


Interpret these findings. 


► 3. Calculate eta 2 and add this information to your interpretation of the 
findings._ 


: ooooooooooooooooooooooooooooooooooooo 

Nemetiyi's Test 

l We have noted that there are a variety of tests (e.g., Scheffe, Tukey) that allow 
us to make post hoc comparisons of means following ANOVA. The Nemenyi's 
test allows us to accomplish the same type of post hoc comparisons following 
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Friedman. If the results of the Friedman alluw us lo iejecl the H 0 , we may want 
to use Nemenyi's test to show us precisely where the differences arc. Wc can 
outline the Nemenyi procedure in six steps. 

Step 1 

First, be sure the overall Friedman x\ statistic was significant. Determine the 
critical value of y\ for your selected a level. For the example on page 356 re¬ 
garding change of votes in response to type of persuasion, the y} R = 12.91 and the 
critical value is 7.81. 


Step 2 

Find a{a + 1) -r 6n 

The symbol a is the number of measures in the study and n is the number of 
Ss. In the vote example, there were four times at which votes were taken, so 
a = 4 and a + I = 5. There were 9 5s, so n = 9. 


4(4 + 1) 
6(9) 


20 

54 


.370 


Step 3 

Multiply the result of step 1 by that of step 2. Find the square root of that value. 


V(7.81X.370) 


1.70 

This is the critical value for the data for Nemenyi. 

Step 4 

Compute the mean sum of ranks for each group. That is, divide the sum of ranks 
by n, the number of 5s in each measure. 

Z n = 30.5 -f 9 = 3.39 
X n = 18.5 - 9 = 2.05 
Xj 3 — 28.5 -r 9 = 3.17 
X 7A = 12.5 -F 9 = 1.39 


Step 5 

Make a table with the mean sum ranks arranged in ascending order across the 
top and down the sides. 
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Mean T4 Mean T2 Mean T3 Mean T1 
1.39 2.05 3.17 3.39 


Mean T4 

1.39 

Mean T2 

2.05 

Mean T3 

3.17 

Mean T1 

3.39 


To fill in the table, list the differences in mean sum of ranks in the table that you 
have set up. For the above example, the differences would be: 

3.39 - 3.17 = .12 

3.39 - 2.05 = 1.34 
3.17 - 2.05 = 1.12 

3.39 - 1.39 = 2.00 
2.05 - 1.39 - .66 
3.17 - 1.39 = 1.78 



Mean T4 
1.39 

Mean T2 
2.05 

Mean T3 
3.17 

Mean T1 

3.39 

Mean T4 

1.39 

.66 

1.78* 

2.00* 

Mean T2 

2.05 


1.12 

1.34 

Mean T3 

3.17 



.12 

Mean T1 

3.39 





* = > 1.70 


Step 6 

Any difference between sum of ranks that exceeds the critical value (step 4) is a 
significant difference. 1.70 is the critical value. 

T4 - T3 = 1.78 = > 1.70 
T4 — T1 = 2.00 = > 1.70 


Step 7 

Conclude that 5s voted significantly higher after the guest speaker than after the 
syllabi presentation; they also voted significantly higher after the class discussion 
than after the syllabi presentation. Thus, the Friedman test told us that we could 
reject the H 0 and conclude that there was a difference in how different types of 
persuasion affected voting outcomes. The Nemenyi test has allowed us to pin¬ 
point more precisely where these differences occurred. 

ooooooooooooooooooooooooooooooooooooo 


Practice 12.4 

► 1. For the Friedman problem in practice 12.3, perform a Nemenyi analysis. 
Interpret your findings._ 
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ooooooooooooooooooooooooooooooooooooo 


To summarize, there arc several nonparamctric tests that can he used to compare 
the performance of different groups or of the same group at different times. Two 
tests have been presented here. The Kruskal-Wallis parallels a between-groups 
one-way ANOVA; the Friedman can be used as the parallel of a one-way 
Repeated-measures ANOVA. These tests arc especially valuable given the fol¬ 
lowing conditions. 

1. You need a test that docs not depend on the shape of the population distri¬ 
bution. 

2. You need a test that allows you to work with small sample sizes while not 
knowing exactly the nature of the population distribution. 

3. You need a test that has samples made up of observations from several dif¬ 
ferent populations. For example, you might say that Japanese, Egyptian, and 
Venezuelan 5s are all from one population-L2 learners of English. If you 
add native speakers of English to the comparison, it would be difficult to say 
they are from the same population. No parametric test can handle such data 
without requiring us to make unrealistic assumptions. 

4. You need a test that can treat data which are in ranks as well as data whose 
seemingly interval scores have only the strength of ranks. 

The tests presented here have the same major disadvantage as reported for tests 
that parallel the f-test. That is, they are wasteful of data. Information is lost, 
and the tests are not as powerful as the parametric alternatives. 


Activities 

1. R. W. Gibbs & R. A. G. Mueller (1988. Conversational sequences and pref¬ 
erences for indirect speech acts. Discourse Processes, II. 1. 101-116.) had 40 5s 
read different conversational sequences involving indirect requests. They exam¬ 
ined two situations: "service encounters," where the addressee's main job is to fill 
requests (such as a store clerk), and "detour situations," in which the addressee's 
activities or plans are interrupted by the speaker. The authors asked 5s to show 
their preference for sequence and expected the 5s would show sequence prefer¬ 
ences similar to those suggested by Levinson (1983). These are arranged below 
so that the most preferred sequence (according to Levinson) is at the top and the 
least preferred at the bottom. 

Example Stimuli-Scrvice Encounter 

(A) prerequest Do you have D-size batteries’? 

response to prerequest Yes, here you go. 
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IB) p re request 
offer 

fe acceptance of offer 
(C) prerequest 
go ahead 
request 
compliance 


Do you have D-size batteries? 
Would you like some? 

Yes, I'd like two. 

Do you have D-size batteries? 
Yes. 

I'll take two. 

Here you go. 


Each 5 read hypothetical situations of both service encounters and detour situ¬ 
ations with stimuli like those above. For each situation, they selected the most 
appropriate sequence (A, B, or C). The stimuli for each situation differed but the 
sequence choices (A, B, or C) were the same. 

Results showed the following mean rankings (from a Friedman test): 


i, 

Service encounter: 

B 

1.65 

Xr = 

= 24.69, 

p<. 001 



C 

1.84 






A 

2.50 




£■: 

Detour situation: 

B 

1.74 

Xr = 

= 36.26, 

p<. 001 



C 

1.90 





A 

2.36 





Explain the above table. (The general conclusion drawn was that conversational 
organization has a strong influence on people's language behavior in making in¬ 
direct requests.) 

The activities below are a review of tests, that compare two or three (or more) 
groups. Look at each of the following examples and decide which statistical test 
you would use. Since there is no one correct answer for each item, it is important 
that you justify your choice. 

2, H. Kinjo (1987. Oral refusals of invitations and requests in English and 
Japanese. Journal of Asian Culture, 11, 83-106.) wanted to know how native 
speakers and second language learners do "refusals." After you read her thesis, 
„ you decided to work out a methodology to look at refusals of requests for a favor. 
You waited until the very end of the term when everyone was very busy and then 
asked 10 randomly selected native speakers of English (5 male and 5 female) and 
10 nonnative speakers (again 5 men and 5 women) for help in entering all your 
data on the computer. You told them you only have about 500 scores and you 
would like them to read these off to you while you enter them. With your trusty 
hidden microphone, you tape-recorded their responses. Hopefully, these all 
turned out to be refusals. (And of course you got permission from all 20 Ss to 
use the data for the research project.) You asked two experts in intercultural 
communication to rank the acceptability of each refusal on a 7-point scale. Each 
point was carefully defined and discussed with the raters. You wonder whether 
these ratings will be different for native and nonnative speakers (i.e., you were 
not going to compare male and female respondents). What statistical procedure 
would you use and why? 
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3. After running the above analysis, uui found ilieie was no difference in ratings 
that could be attributed to first language of the .S's. You wondered whether your 
experts' ratings of appropriateness was influenced by their acceptance of differ¬ 
ences and that this led them to judge everyone about the same. So. you decided 
to ask the judges to listen to the refusals once again and judge whether the refusal 
was "direct" or "indirect.' Fortunately, the judges always agreed in their judg¬ 
ments so you had confidence in them. You want to know whether language 
background makes any difference in whether refusals were judged as direct or as 
indirect (you might also wonder whether the sex of the refuser might also make 
a difference in the (in)dircctncss of the response, but we haven't learned how to 
do this yet!) What statistical procedure would you use and why? 

4. In the language lab evaluation study mentioned earlier in this text, you may 
have noticed that all the 5s were Vietnamese refugees. Later, the teachers de- I 
cidcd to include Khmer and Lao students in the evaluation. Since some of the 
tests require basic literacy skills (and only Vietnamese of these three languages 
uses a Roman script), they decided to look to see if the groups were equally lit¬ 
erate in F.nglish. A reading test developed at the camp was used to measure 
reading skill. It is difficult to get a feel for how "interval" the data might be. 
Not being confident about the test, wc advised a nonparamctric test. Which test 
would you suggest and why? 

5. Differences in literacy were found among the groups in the above example. 
A set of lab materials were then designed that would integrate listening compre¬ 
hension and literacy training. Once the materials were ready, the lab staff taught ■ 
this as a new tcn-wcck course. At the end of the course, the Lao and Khmer 
students were tested on the same reading test. The question was whether there 
would be improvement in scores for these :wo groups of special 5s. Which sta- : 
tistical procedure would you suggest and why? 

6. All entering F.FL/ESL students at your university have taken a general profi¬ 
ciency test. All items related to verb morphology in the test are the data for a 
study of how 5s from different L! backgrounds perform on English tense and 
aspect. The verb morphology of some of these LI groups is rather similar to that 
of English; for other LI groups the morphology is not similar at all. You are 
undecided as to whether to group them in this way first and do the analysis or 
whether to do a more straightforward comparison across the languages first and 
then compare those that are or are not similar to English later. A number of 
statistical procedures might be used. Which would you select and why? 

7. Another research question for the data in example 6 might be whether the 
verb morphemes form a scale or not. If so, you might form a "natural-order" 
hypothesis. If this is the research question, what statistical test would you use 
and why? 

In your study group, compare your responses to each of these items. Try to reach 
a consensus regarding which choices would be best considering the information 
given in each justification. 
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Chapter 13 

Comparisons of Means in 
Factorial Designs 

II' . 


•Factorial designs 

•Calculating AN OVA for a 2 X 2 design 
Interpreting Factorial A NOV A 
Strength of relationship: omega 2 
•Assumptions underlying Factorial AN OVA 
•Selecting Factorial A NOV A programs 


Factorial Designs 

In chapter 11, we discussed ways in which we might test the differences in the 
[performance of three or more groups. In that discussion the groups represented 
levels of one independent variable. The research question was always whether 
performance on the dependent variable differed across the groups represented in 
the levels of only one independent variable. 

Factorial designs are those where more than one independent variable is involved 
In the design. For example, you might investigate whether Ss taught using one 
method outperform those taught by other methods. The dependent variable 
fhight be performance on a language achievement test. The first independent 
[Variable is method , and there might be several methods compared. In the same 
[Study, you might have tried to balance the number of female and male .Ss, but 
you do wonder whether females and males might perform differently on the 
[achievement test. Sex, a moderating variable, is the second independent variable. 

Here is the box design represented in the study. Two methods are being com¬ 
pared and two levels for sex make this a 2 X 2 factorial design. 

Sex 

Method Male Female 

§? Teacher 

Centered 
Cooperative 
Learning 
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The analysis, then, compares four different groups: 

Group 1--Males in tcacher-cemered class 
Group 2--Females in teachcr-ccntcred class 
Group 3--Males in cooperative-learning class 
Group 4-Females in cooperative-learning class 

You want to check performance of all 5s in the teachcr-ccntcred classroom with 
all 5s in the cooperative-learning classroom (groups I + 2 vs. groups 3 4 4). 
You also want to be sure that females and males perform in similar ways, and $ 0 
you will compare all males vs. all females on the test outcome (groups 1 - 3 vs. 
groups 2 + 4). However, it is also possible that one methodology may "work: 
better" for males or for females in the sense that one or the other may benefit 
more from one method (group 1 vs. 2 vs. 3 vs. I). 

In this study, we hope that one method "works better" for all 5s and that this 
shows up in the "main effect" of method. If we are advocates of cooperative 
learning, then we hope that the 5s in the cooperative-learning group will outper¬ 
form those in the tcacher-centcred group. We hope that men and women perform 
in the same way and that there is no interaction where, say, males in the; 
cooperative-learning group do better than females and that females do better 
than males in the teachcr-centered class. If there were such an interaction, it 
would weaken the argument in favor of cooperative-learning techniques. (And,: 
as a practical matter, it is unlikely that we could apply such results by setting up; 
separate classes for males and females.) 

In One-way ANOVA, we saw that there was only one independent variable. 
When we compared SSB and SSW and computed the F ratio, we attributed the 
difference between SSB and SSW to the levels of that one independent variable. 
In oui example, we expect that there will be variability in the performance of 5s 
on the achievement test. We want to know what effect the methodology factor 
had on that variability. We also want to know what effect the sex factor had on 
variability in the data. And, further, we want to know the effect of the combi¬ 
nation of method and sex on variability in test performance. 

1. Effect of method (factor A): teacher-centered vs. cooperative-learning 

2. Effect of sex (factor B): male vs. female 

3. Interaction effect (factor A X B) 

The advantage of using a Factorial ANOVA is that we can look not only at the 
effect of each independent variable but also the interaction effects in the combi¬ 
nation of different independent variables. 

Imagine that all the 5s participating in this study were selected at random from 
a population of learners. In addition, they have been randomly assigned to a 
French I class taught using tcacher-centcred procedures or to a French I class, 
using cooperative-learning techniques. The curriculum objectives for the classes 
are the same. The same teacher teaches the two classes, so the design is not 
confounded in a technical sense. However, it would be important that the teacher 
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Igot know which is the experimental treatment group and not feel that one tech¬ 
nique is any better than the other (an unrealistic hope, of course). Since none of 
the 5s had previously studied French, no pretest was given. 

|^t the end of the semester, an achievement test consisting of 25 items was given 
|fnd we assume that the scores reflect equal-interval measurement. Fortunately, 
|he test turned out to be highly reliable even though relatively short. Further, 
|ince random selection and random assignment were used, we believe that the Ss 
|brm samples which are representative of the population of college students en- 
folling in our French I classes. 

Let's fill in the chart for this study: 


Research hypothesis? There is no effect on achievement for sex or 
method, and there is no effect for the interaction. 

Significance level? .05 
l- or 2-tailed? 2-tailed 
Design 

Dependent variable(s)? Achievement 
Measurement? Scores (interval) 

Independent variable(s)? Sex, Method 

Measurement? Nominal (two levels for each independent vari¬ 
able) 

Independent or repeated-measures? Independent 
Other features? Random selection and assignment 
Statistical procedure? Factorial ANOVA 


iin One-way ANOVA, we were able to look for "treatment" effect by comparing 
ithe two components of variance: 


Total Variance 



Variance Variance 

within between 

Groups Groups 


In Factorial ANOVA, the variance for this study will be divided in the following 
way: 
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Total Variance 



The within-group variance is ordinary "error" variability-it results from the 
normal distribution of individual scores within each of the groups. However, the 
between-groups variance contains within it the effect of factor A, factor B, or' 
factor A X B. As in One-way ANOVA, we want to compute the values of vari¬ 
ance for each component and then put each in an F ratio to see which, if anyj 
exceed the normal within-group variance. 

2 

/• 

' I'aclorA 2 

S W 

(effect of method-factor A) 

2 

,, 

' FaclorR ' 2 

5 IV 

(effect of sex-factor B) 



Before we go on to the calculations used in computing these F ratios (don't worry, 
they are very similar to those used for One-way ANOVA), stop and chart the 


372 The Research Manual 


fpartialing of the total variance in each of the following studies. 
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

(practice 13.1 

i|jse the "tree diagram" given in the previous discussion as a guide. Chart the 
fpartialing of total variance into that for each independent variable. 

1. Abraham (1985) studied the differential effects of two methods for teaching 
Ijparticipial formation on 28 field-dependent and 28 field-independent students. 
Two CA1-ESL lessons were used, one using a traditional deductive-rule approach 
and the other using examples in context. Field-independent learners got higher 
posttest scores (dependent variable) with instruction from the deductive lesson; 
the example lesson was more beneficial for field-dependent students. 


§► 2. Pritchard (1987) collected 10 essays each from 383 junior and senior high 
fschool students whose teachers either had or had not been trained in the National 
iWriting Project Model. A 6-point rating scale (with detailed operational defi¬ 
nitions for each point on the scale) was used to rate the compositions. Among 
many other Findings in this study, the author notes that differences were found 
between students from classes taught by NWP-trained teachers vs. nontrained 
teachers, but an interaction showed that this difference was really valid only at 
|fhe junior high level. 


3. Do you think that the authors were pleased that the interactions turned out 
to be significant? Why (not)?_ 

I 

v;~ 
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I his should give you a forewarning about interpretation of the results of facto¬ 
rial ANOVA. That is, when an interaction turns out to be significant, the 
findings must be interpreted in light of the interaction (rather than in terms of the 
individual independent variables). 

ooooooooooooooooooooooooooooooooooooo : 


Calculating ANOVA for a 2 X 2 Design 

Researchers seldom, if ever, carry out Factorial ANOVA calculations by hand. 
However, it is something that everyone should try once (and perhaps only once) 
in order to understand how ANOVA works. 

There are six steps in the computations: 

1. Compute sum of squares total (SST). 

2. Compute sum of squares between (SSB). 

3. Compute sum of squares within (SSW). 

4. Compute sum of squares for factor A (55^). 

5. Compute sum of squares for factor B ( SS B ). 

6. Compute sum of squares for the interaction of factors A and B (SS^g) . 

We will use the example of method (teacher-centered vs. cooperative-learning 
instruction) and sex as independent variables with an achievement test as the 
dependent variable. There are three null hypotheses to be tested: 

1. There is no difference in achievement test scores according to method. 

2. There is no difference in achievement test scores according to sex. 

3. There is no interaction effect between sex and method on achievement. 

In order to simplify the calculations, we have limited the number of 5s to five in; 
each group. Again, in a real study, we recommend a larger n size for each group. 
Five is an absolute minimum. Thirty per group give us a much better population 
estimate. The data given in the following table are fictitious. 
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1 


lethod 
factor A) 



Sex (factor B) 


Female 


Male 


X 

X 2 

X 

a: 2 

9 

81 

14 

196 

10 

100 

12 

144 

8 

64 

9 

81 

7 

49 

9 

81 

10 

100 

14 

196 

IX =44 

IX 2 = 394 

IX = 58 

IX 2 =698 


Mean = 8.8 

Mean = 11.6 

16 

256 

11 

121 

15 

225 

10 

100 

20 

400 

9 

81 

14 

196 

7 

49 

12 

144 

9 

81 

IX = 77 

IX 2 = 1221 

IX = 46 

IX 2 =432 

Mean = 15.4 


Mean = 9.2 


Mean for Females : 
Mean Coop Lrng 


: 12.1 Mean for Males = 10.4 

12.3 Mean for Tchr Ctr = 10.2 

Grand Mean = 11.25 


fhe Xs for the cells do appear to be quite different. There is a large difference, 
5r example, between a X of 8.8 and a X of 15.4. However, remember that the 
mailer the sample size, the larger the difference among Xs must be in order to‘ 
achieve statistical significance. The raw Xs also look as though there may be aji 
nteraction in that females in the cooperative-learning group have the higher X 
vhile the higher male X is for the teacher-centered group. To know whether this 
is the case, we must carry out the statistical procedure. 

Step 1 

Compute the sum of square total (SST). 


The value of XX 2 should be familiar to you by now. Square each individual score 
and then sum these. 


Y^X 2 = 9 2 + 10 2 + 8 2 + ~ + 7 2 +9 2 
^V 2 = 2745 


To find (X^0 2 » first sum ^e individual scores and square that value. 
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(^.¥) 2 = (9 + 10 + 8 + - + 7 + 9) 2 
(^,¥) 2 = 225 2 
( Y \ v ) 2 = 50625 


There are five 5s in each of the cells of the design. The /V, therefore, is 20. Now 
that we have all the values needed for SST, we can complete the following com¬ 
putations. 

ssr= yy_il£L 

SST = 2745 — M) ^ 25 
557 = 213.75 

Step 2 

find the sum of squares between (SSB). 

I he sum of squares between (SSB) contains the treatment effects (just as it did 
for One-way ANOVA). It also includes the effect of sex and of the interaction 
between method and sex on the test. Once we find t he value of SSB. then, it w\'.i 
be further partialcd out so that we can see the effect of each of these individually. 

Since we arc working from raw data, the formula wc will use for SSB is: 


( I/' )2 <I*2> 2 <E V 3 )2 <X A 4> 2 ] (X A)2 

n, + «2 + «3 ' "4 N 

tTR r 442 * 5?2 4. 772 + 4< > 2 1 2252 

SS £-^_ + _ + - r + —— 


_ 58 i jf _ 46 2 ~| _ 225 2 
5 5 5 20 


SSB = 137.75 

Step 3 

Compute the sum of squares within (SSW). 


This step is easy since we already have SST and SSB. All wc need to do is sub¬ 
tract the between-groups variance from the total variance. 

55 W 7 = SST SSB 
55^- 213.75 - 137.75 

(We hope these are the same values that you got.) 
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SSW = 76.0 


emember that SSW is the variance associated with normal variability ("error" 
ariability) in performance. That is, it is not influenced by either method or sex 
ft the interaction. 


bmpute the sum of squares for factor A—method (SS A ). 

SB includes all the variance due to the effects of the independent variables (i.c., 
ethod and sex and the interaction). Wc need to divide up the variance in SSB 
show us that for factor A, method. To do this, we will add up the total score 
hr each level of method (that for cooperative-learning and that for teacher- 
entered instruction). Then we will divide each of these by the n for the level. 

t the end of the upper left quadrant of the data table, you will find the ZX for 
ie total scores for women in the teacher-centered group. It is 44. In the upper 
ight quadrant is ZX for the total scores for men. So ZX for everyone in the 
Vteacher-centercd group is 102. 

'here are 10 5s in the cooperative-learning group. Now, let's find ZX for the 
;6operative-learning instruction. That for females is 77 and for males is 46, so 
e total ZX for this treatment is 123. 

ow we can put this information into the formula for the sum of squares for 
ctor A (method). 


(^scores/! t ) 2 (^scoresT 2 ) 2 (^X ) 2 


= f I02 2 I23 2 
1 |_ 10 10 


SS A = 22.05 

ow that we have the portion of SSB that belongs to method, we can find the 
part for sex. 


s tep 0 

Compute the sum of squares for factor B—sex (55 B ). 

Now we need to partial out the part of the SSB variance for the effect of sex. 
fhe information we need, once again, is in the data chart. We add the scores for 
females in the upper left quadrant (where ZX = 44) with that of females in the 
fewer left quadrant (where ZX = 77). The total ZX for females is 121. The total 
ter males is 104. There are 10 females and 10 males in the total study. (A bal¬ 
anced design-hooray for small mercies!) 
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(^scores B {) 2 (^scores fi 2 ) 2 


n B\ 


n B2 

2 ~ 


& 


55, 


[~ 121 2 104 2 

' L 10 10 _ 


225 

20 


SS R = 14.45 


Now that we have the sum of squares for each of the main effects in the study, 
we need to compute the sum of squares for the interaction of the two. 

Step 6 

Compute the sum of squares for the interaction (55^ e ). 

This will be easy since wc already have the SS for factor A ( method ) and factor 
B (sex). SSB contains the effect of both of these main factors and that of the 
interaction. So, all we need to do is subtract. 

SS 4I{ -- SSB (SS A 4 SS B ) 

SS /Uf - I 37.75 - (22.05 + 14.45) 

SS A8 - 101.25 


We hope that you are checking these calculations as wc go along. Do the given 
values of SSB and SS A and SS B agree with the values you computed? If not, 
check with your study group to see where you went wrong. 

Now that vve have worked through each of the six steps and have found the sum 
of squares for each, we need to divide them by their respective degrees of free¬ 
dom. 


The total df for the study = N - I. There were 20 5s in the study, so the total 
df = 19. The df for SSW is N - K. There were 20 5s and 4 groups so: 
.V - K~ 16 df. For factor A, there were 2 levels, so a - 1 = 1 df for method. 
For factor B, there were also 2 levels, so a - I = 1 df for sex. The df for the 
interaction is found by multiplying the df for factor A by the df of factor B. 
1x1=1 df 

You can check these degrees of freedom for accuracy by adding them to see if 
they are the same as the total df for the study. 16 + 1 + 1 + 1 = 19, which the- 
total df N - I. 

Now wc can divide each sum of squares hv its respective degrees of freedom to 
obtain the variance that can be attributed to each. These variance values are (as 
before in the other ANOVAs we have computed) called the mean squares (MS). 
Let s fill in the ANOVA table for the values wc have obtained so far. 
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■ 


Source _ 

Between Groups 
Method 
Sex 
A X B 

Within Groups 
Total 


i 


jsfow we arc ready to compute the F ratios for each factor and for the interaction. 


F - ratio A{Method) = 


o2 MSW 4.75 


- ratio B(Sex) = 


F — ratio A X B = - 


tfou can place each of these F ratios in the appropriate row in the column labeled 
S’ ratio above. 

The final step in our calculations, of course, is to discover whether the obtained 
ratios are sufficiently larger than 1 to allow us to reject the H 0 . We can find 
■he critical value of Ffor 1,16 df for each of the three effects (method, sex, A X 
B) in the F distribution table, table 5 of appendix C. The intersection of 1 and 
|6 df shows that a critical value of 4.49 is needed to reject the H 0 for an a of .05. 
3ur ratio for method is larger than the required critical value of F. We can place 
an asterisk next to the method F ratio to show this. The F ratio for sex is not 
larger than the critical value, so we cannot reject the H 0 for sex. Unfortunately, 
he F ratio for the interaction of method X sex is also larger than the critical value 
)f F. This leads us to the issue of interpretation. 


■.i 


Interpreting the Interaction and Main Effects in Factorial AMO FA 

n the above example, the interaction of method and sex is significant. This 
means that the effect of method was moderated by that of sex. To see how this 
works, let's chart the A’s of our groups. The X for women in the teacher-centered 
pproaeh was 8.8 and for the cooperative-learning approach, 15.4. The X for 
men in the teacher-centered approach was 11.6 and in the cooperative-learning 
method, 9.2 These are charted on the following figure. 
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CoopLmg 

TchrCtrd 


16 1 


14 | 


12 i 


10 - 

g J-,- - 1— 

male female 

To draw this figure wc located the X for cooperative learning for males and for 
females and drew a line connecting these two points. This line is labeled cooper¬ 
ative learning. Then wc located the X for males andjcmalcs for the teacher- 
centered method and drew a line to connect these two X’s. It is labeled teacher 
centered. 

This graph shows that while women di<t much better if they were in the 
cooperative-learning group, men showed somewhat better performance in the 
tcachcr-centcred group. Since there is such a strong interaction, we would nat¬ 
urally be suspicious of any claim that said the two methods differed and that ore 
was better than the other overall. Rather, we must interpret the Findings in light 
of the interaction. The interaction overrules the main effect. 

It is true that wc almost always hope that our main effects (such as method in this 
case) turn out to be significant. Then we can say there is a clear difference which 
relates to the independent variable. When we add moderator variables (such as 
sex in this case), we are checking to make sure nothing else may play an impor¬ 
tant role. Wc may hope that this particular variable will not turn out to be sig¬ 
nificant. However, if it is significant in the study, a chart of the means will help us 
interpret the findings. 


22 “ 

20 - 
18- 
16- 
14- 
12 -' 

male female 
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[n this case, our interpretation would be that the H 0 of no difference between the 
two methods could be rejected. The methods differ significantly with 5s per¬ 
forming better in the cooperative-learning group. The H 0 of no difference for sex 
could also be rejected. Males did better than females regardless of method. The 
figure shows no interaction. Therefore, the main effects can be interpreted inde- 
||cndcntly. 

More important, we hope that the interaction of the main variables will not be 
significant. We want a clear-cut difference in method without any "interference" 
from the interaction with other variables. When an interaction is significant, we 
must focus our interpretation on it and not make claims about the significance 
of the main variables. 

Our interpretation of the ANOVA table in this example, then, is that any differ¬ 
ence found in the two methods must be attributed to the fact that women did 
better on the achievement exam if they were in the cooperative-learning group. 
To apply this information would be difficult. It's unlikely that we could or would 
want to segregate women into a cooperative-learning class because they seemed 
to do much better with this methodology. However, it is also true that the dif¬ 
ference between the two methods was not so pronounced for men, so perhaps we 
could argue that they might not be harmed and women would be helped if they 
all enrolled in classes that used cooperative learning. Obviously, the best solution 
ffivould be to vary techniques to take advantage of both methods in teaching 
gfVench I at our university. It would also be important to check to see if there 
were any important differences in the 5s (despite random assignment) in the 
teacher-centered and cooperative-learning classes. There might be other factors 
which challenge the validity of subject selection in the study. 

When we plot an interaction using the Xs of each subgroup, it is not always the 
case that the figure for the interaction will form a crossover pattern. It is possible 
that a figure such as the following might be found. 


15 


10 



CoopLmg 

TchrCtrd 


male 


female 


Even though there is no crossover, the effect of method is still clearly due to better 
performance of females in the cooperative-learning approach. The difference in 
fmethods for males appears trivial. 
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It is always helpful to chart the Xs in order to understand exactly how best up 
interpret a Factorial ANOVA. Again, it is important to remember that inter¬ 
action effects must qualify any statements vve make about the effect of the main 
variables by themselves. The interaction effect washes out the main effect. That 
is why the interpretation of the results of ANOVA must be carefully done. The 
interpretation must focus on the interaction effect when it is significant. When 
the interaction effects are not significant, stronger statements can be made about j 
the effect of the independent variables on the dependent variable. 

We have said that researchers almost always hope that no significant interaction 
effects are found. However, there are cases-such as in cross-cultural rcscarch- 
vvhere researchers do expect to find significant interaction effects. For example, i 
if you nad test data where 5s from several different LI groups judged the ap- • 
propriatcncss of information-request forms, you might include sex as another 
major variable. In this case, you might not expect males and females to differ in 
their judgment scores, but you would probably expect to find that judgments vary 
not just by LI membership (the other main effect being tested) but in interaction i 
with sex. This would show, for example, that male or female 5s within certain 
LI groups judge the appropriateness of requests differently than male or female 
5s in another LI group. Again, the interpretation would be in terms of this 
interaction rather than in terms of simple LI membership. 

As always, it is helpful in reading a research report to begin with the abstract and : 
then turn to the tables and interpret the findings. With complex designs that use i 
ANOVA this is an especially important step in evaluating research. Read and; 
interpret the tables and then turn to the full article, checking your own under¬ 
standing of the results against that of the author. ANOVA is an excellent sta¬ 
tistical test precisely because it allows us to look at interaction effects and to 
interpret the main effects in light of the interactions. If you train yourself to do 
this as you read the research reports of others, it will become much easier to offer 
an interpretation of your own research findings. 

We have said that it is possible to design a Factorial ANOVA with many differ¬ 
ent factors. From our discussion of interaction effects, we hepe you can see why 
it is often so difficult to interpret complex designs. The more independent vari¬ 
ables are added into a Factorial design, the more difficult it becomes to interpret 
the multiple interactions that may result. Rather than combining many inde¬ 
pendent variables in one study, we might design a series of studies where wc vary 
the independent variables. As you will see in future chapters, another solution 
is to use correlation-based statistical procedures to get at the importance of large 
numbers of independent variables. 


Strength of Relationship: omega 2 

As par. of interpretation, we can get a rough measure of strength of relationsh'.p 
by using the omega 2 procedure. There were two main effect variables and one 
interaction in our example, so there are three possible omega 2 computations: 
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2 SS A -{dj A )MSW 
Wji SST+MSW 

u 2 SS B -(df B lMSW) 
< ° B SST+MSW 


2 SS AB WauWSW 
AB SST+MSW 

However, for the methodology problem we just completed, remember that the 
ffects for method (A) and the interaction (AB) were significant. Sex (B) was not. 
therefore, we have no reason to do an omega 2 for sex. 


2 _ 22.05 -(1)4.75 
° A ~ 213.75 + 4.75 


101.25 -(1 >4-75 
213.75+ 4.75" ~ 


So, we can see that although the effect for method was significant in the 
ANOVA, the strength of that relationship is not great, only 8%. The interaction 
if sex X method accounts for much more of the variance, 44%. Obviously, the 
nteraction is much more important than method alone. 

Using omega 2 helps us to understand why it is so important to interpret ANOVA 
ith reference to any significant interactions. It also lets us know how much 
: . ariance is left unaccounted for. The 48% does cover almost half of the variance, 
but it still leaves 51.4%-the result of "error"--random variation that is not ac¬ 
counted for in the design. This suggests we might try to discover what factors 
S5 other than sex (perhaps learning styles) might interact with these particular 
leaching methodologies. 

Remember that omega 2 is used for balanced designs. If the design is unbal¬ 
anced, use the eta 2 formula presented in chapter 12 (w 2 = SSv + SST) as an esti- 


anced, use the eta 2 formula presented in chapter 12 (rj 2 
nate of the strength of the relationship.) 


Assumptions Underlying Factorial ANOVA 

The assumptions underlying a regular between-groups Factorial ANOVA are 
those of all ANOVA procedures. 
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1. The data are score or rank-order data that are truly continuous. 

2. The data are independent. If the design includes repeated-measures or 
within-group comparisons, there are special ANOVA procedures available in 
SAS, SPSS-X, and BMDP that can be used instead. 

3. The distributions of scores in the samples are normal enough that X and 
variance are appropriate measures of central tendency and variability. 

4. The data form a normal distribution and equal variances in the respective 
populations. 

5. The design is balanced (otherwise use general linear modeling 
procedurcs—GLM in computer statistics packages). 

6. There arc at least five observations per cell. 

7. The F statistic allows the researcher only to accept or reject the H 0 (other 
statistical procedures can be used to locate differences among levels of those 
variables where the F statistic is statistically significant). 

For a thorough discussion of Factorial ANOVA and the assumptions underlying 

the procedures, see Kirk (1982) or Winer (1971). 


ooooooooooooooooooooooooooooooooooooo 

Practice 13.2 

► 1. Chapter 11, practice 11.3, presented a problem about determining the best 
method for teaching composition. In that study, three methods were devised for 
teaching composition. 5s were randomly assigned to one of the three methods. 
All 5s had the same topics for their compositions. At the end of the semester, the 
final compositions were graded by five raters and the scores were averaged. Im¬ 
agine the ESL director wanted to see if the same results hold for the remedial 
ESL composition class as well. The methods and topics were as they were in the 
regular classes. The final composition results from three classes to which the re¬ 
medial 5s were assigned were: 


Method 1 

Method 2 

Method 3 

12 

19 

13 

14 

20 

15 

10 

10 

9 

13 

19 

15 

11 

12 

10 

12 

14 

13 

15 

18 

14 

10 

21 

12 

12 

19 

13 

14 

16 

14 
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for the regular ESL composition classes: 


Method 1 
JX = 207 
X = 20.7 
I* 2 = 4351 
« = 10 


Method 2 
YX= M9 
* = 14.9 
I* 3 = 2285 
« = 10 


Method 3 
XX = 179 
A' = 17.9 
£*2= 3291 
n = 10 


Make c/ass be factor A, and method be factor B. 

a. Do a 3 X 2 Factorial ANOVA for these data. Note that for some of the steps 
you will have more factors in the equation. SSB will have 6 expressions in the 
brackets instead of 4. Likewise, SSB in step 5 will have 3 expressions in brackets 
instead of 2. The df for factor B ( method) will be 2 instead of 1, right? 

Step 1: SST = _ 

Step 2: SSB = _ 

Step 3: SSW =_ 

Step 4: SS A = _ 

Step 5: SS B = _ 

Step 6: SS AB = _ 

F ratio for A = _ 

F ratio for B = ;_ 

F ratio for A X B =_ 


b. Which null hypotheses can be rejected? 


Calculate omega 2 for significant F ratios. 

2 SS A - jdf A XMSW) 


01/4 SST+MSW 


m A = - 


2 SS B - (dfutMSW) 


SST+MSW 


"8 - . 


2 

W AB 


SS AB -{df AB )MS w 

SST+MSW 


2 


C0 AB = 
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d. Draw a graph of the interaction and interpret it. You can make cither method 
or class the .Y-axis. 


e. What conclusions can you draw? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Selecting Factorial ANOVA Programs 

The k.nd of Factorial ANOVA you use U> test differences among A s of groups ■ 
will depend on a number of considerations. These include: 

1. Is the design for fixed effects or random? 

2. Is the design a bet ween-groups design? 

3. Is the design a repeated-measures design? 

4. Is it a mixed design where some variables arc between-groups and 
others are within-group comparisons? 

5. Is it a balanced or unbalanced design? 

The exact type of ANOVA procedure you use must be specified in research re- . 
ports. Obviously, we do not expect the researcher to carry out complex mixed : 
designs by hand. There are, however, a wide range of possibilities open to you 
using SAS, SPSS-X, or BMDP programs. In SAS, we recommend that you use ; 
GLM (General Linear Modeling) for both one-way and factorial designs. In : 
SAS, ANOVA only handles balanced designs while GLM does the same work for 
both balanced and unbalanced designs. Since SAS won't give you an error mes¬ 
sage if you run an unbalanced design on ANOVA, it's better to play safe and use 
GLM. 

GLM carries out a variety of ANOVA procedures. In research reports, you may 
see abbreviations that look much like ANOVA. One GLM subcommand, ; 
MANOVA, may be used. This stands for "multivariate analysis of variance" aid : 
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is used when the research design includes more than one dependent variable. For 
example, the research might include two measures from different tests. 

pdiother abbreviation you may sec in the literature is ANCOVA. This stands for 
"analysis of covariance." This procedure allows you to control for some variable 
factor-pcrhaps a pretest score or a language aptitude scorc--so that the mea¬ 
surement of the dependent variable is adjusted taking into account this initial 
difference among Ss. 

\Vc haven't presented formulas for calculating ANOVA in these complex designs, 
but they can be found in Kirk (1982). Readable explanations of these complex 
designs are given in the SPSS-X manual (1986). In addition, a nonparametric 
Factorial ANOVA procedure is covered in Meddis (1984). In order to help you 
when you must decide among these procedures, you need to be able to identify 
variables and know which are repeated-measures (within-group) and which are 
between-groups comparisons. The activities section at the end of this chapter 
gives you additional opportunity to practice this decision-making process. 

In part III, we have presented statistical tests that compare the performance of 
groups. The statistical procedures which allow us to make the comparisons arc 
Selected according to the number of groups (two, or more than two), the inde¬ 
pendence of the data (between-groups or repeated-measures designs), and the 
Type of measurement used (continuous data that is normally distributed so that 
The X and s.d. are the best measures of central tendency and dispersion, or con¬ 
tinuous data where the median and rank order are more appropriate). As a guide 
fib selection of an appropriate statistical test, turn to the flow chart on page 543. 
|You will find the procedures'presented in part III grouped immediately below the 
l&ritry for Guttman (Implicational scaling). The flow chart is a quick guide. The 
Assumptions for each test must also be consulted. To do this, check the assump¬ 
tions and solutions for the procedure in the appropriate section of the review be¬ 
ginning on page 548. The flow chart and the information contained in the review 
(Of assumptions of tests are not meant to supplant the information given in each 
fchapter. Once a procedure is selected, it will always be a good idea to go back 
(and review the chapter in which it is presented. That is, these charts arc supple¬ 
ments or quick guides, rather than the final word on options and assumptions. 

You will notice that each procedure in part III has compared performance of two 
or more groups on some dependent variable. The basic question answered has 
been whether there is variation in performance that can be traced to differences 
Tn the groups identified by the independent variable(s). In part IV, we will look 
at relationships among variables rather than the effect of independent variables 
on a dependent variable. Instead of comparing group means (or medians), the 
(Statistical tests, for the most part, will examine how well the data for one variable 
prelate or correlate with the data for other variable(s). 
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Activities 


Read the brief descriptions of the following studies, all of which used some type 
of Factorial ANOVA in data analysis. First, draw a design box for each study. 
Then determine the type of ANOVA (number of dependent and independent 
variables; between-groups or repeated-measures, etc.). From such a brief de¬ 
scription it is difficult to tell whether the basic assumptions of ANOVA have been 
met. However, try to decide whether you think this is the case. Consider yourself 
as the researcher and comment on changes you might want to make to simplify 
(or further complicate) the research. In eases where you arc not certain whether 
the basic assumptions of ANOVA have been met, suggest alternative procedures 
if you can. 

1. E. Heniy & A. Sheldon (1986. Duration and context effects on the perception 
of English /r/ and /!/: a comparison of Cantonese and Japanese speakers. Lan¬ 
guage Learning, 36, 4, 505-521.) used a procedure where five native speakers of 
Cantonese were given identification tests consisting of words containing English 

r ' and /I/ in four positions: initial, medial, consonant cluster, and final. Results 
showed that accuracy of perception depended on location of the sound (percep- 
tion was best in initial and medial position), not the duration of the sound. 1 his 
finding did not support the Duration Hypothesis, which was based on Japanese 
speakers finding final position easiest to perceive. In discussing their findings, the 
authors hypothesized that these diffesences might be due to the status of these 
phonemic contrasts (or lack thereof) in Cantonese and Japanese. 

2. .1. Vaid (1987. Visual field asymmetries for rhyme and syntactic category 
judgments in monolinguals and fluent early and late bilinguals. Brain and Lan¬ 
guage. 30. 263-277.) conducted another study to support the notion that most 
language tasks require preferential use of the left hemisphere of the brain. In this 
study, 5s heard a word and then had to identify a different visually projected 
word as a ± rhyme. When the projected word was flashed in the right visual fie’.d 
(and thus processed in the left hemisphere), both monolingual and bilingual Ss 
responded more quickly than when the word was presented to the left visual field. 
Second, 5s heard a word and then a sentence containing that word. This was 
followed by a different word flashed in the right or left visual field. 5s had *.o 
decide whether the word was the same part of speech as in the heard stimuli. 
Again, material flashed to the right visual field was responded to more rapidly 
by monolinguals and bilinguals. In both tasks, late bilinguals responded more 
slowly than early bilinguals. 

3. J. Reid (1987. The learning style preferences of ESL students. TESOL Quar¬ 
terly, 21. 1, 87-111.) conducted a survey of 1,388 students to discover their per¬ 
ceptual learning style preferences. Subjects from nine language backgrounds 
indicated their attitudes about statements which reflected six learning styles: vi¬ 
sual, auditory, kinesthetic, tactile, group learning, and individual learning. Re¬ 
sults show that ESL learners strongly preferred kinesthetic and tactile learning 
styles, with a negative preference for group learning. Japanese speakers were 
most frequently different in their preferences. 
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14. R. Scarcella (1984. How writers orient their readers in expository essays: a 
icomparative study of native and non-native English writers. TESOL Quarterly, 
18, 4, 671-688.) analyzed 110 essays (30 native English and 80 ESL) for orien- 
ftation, the preparatory material before the statement of the theme. Nonnative 
speakers wrote significantly longer orientations than native speakers, but the 
I longer orientations tended to contain more "known," and thus unnecessary infor- 
Imation. Japanese speakers wrote significantly shorter orientations than the other 
f three first language groups studied, but they wrote lengthy orientations following 
ffthe theme. 

|5. R. J. Vann, D. E. Meyer, & F. O. Lorenz (1984. Error gravity: a study of 
faculty opinion of ESL errors. TESOL Quarterly, 18, 3, 427-440.) asked 164 
faculty members to rate the relative seriousness of 12 common ESL written er¬ 
rors. Their judgments generated a hierarchy of errors, with word order errors 
Ibcing the least acceptable and spelling errors being the most acceptable. The age 
and academic field of respondents appeared to be important factors in responses. 

6. W. K. Tsang (1987. Text modifications in ESL reading comprehension. RELC 
Journal, 18, 2, 31-44.) undertook a study to examine the effects of text version 
I (native speaker, input modified, interactional structure modified) and form level 
(forms 3-7, roughly equivalent to grades 9-13) on reading comprehension. Four 
^hundred and one Cantonese speakers were randomly placed in one of three 
fgroups representing the three kinds of text versions. Ss then read the text version 
fassigned to their group; all 5s answered the same multiple-choice comprehension 
questions after reading the passage. 

A two-way ANOVA showed significant effects for the two main effects and the 
interaction. Post hoc comparisons show that the modified texts were more effec- 
ftive in fostering comprehension than the unmodified native speaker versions, 
while the modified input was more effective than the interactionally modified 
fversion. Input modified texts were better for 5s in form 3 while input and 
f interactionally modified ones were best for form 4 Ss. 

The following table for strength of effect is also presented: 


omega 2 Analysis 


Source 

U) 2 

% of variance 

Text version 

0.055 

5.5 

Form level 

0.287 

28.7 

Interaction (text X form) 

0.049 

4.9 

Error 

0.609 

60.9 

Total 

1.000 

100% 


In light of the co 2 figures, which is the most important effect? What does this 
mean? How much variability has yet to be explained? 

7. E. Shohamy (1987. Does the testing method make a difference? The case of 
reading comprehension. Language Testing, 1, 2, 147-170.) studied the effects of 
Ivarious methods of testing reading comprehension. Six hundred fifty-five 5s were 
frandomly assigned to one of eight groups representing various combinations of 
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three independent variables: (1) question typc-multiplc-choicc or open-ended: (2i ; 
questions in 1.1 (Hebrew) or L2 (English); and (3) version of the text (2 topics). 
All ,S's read a text in English (the 1.2) and answered questions based on the text. 
Results of a Eactorial ANOVA showed significant effects for the independent: 
variables. The addition of a fourth variable, proficiency level, showed that effects 
were more pronounced in the lower-level .Vs. 
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Chi-square Procedure 

\ Nominal data are facts that can be sorted into categories such as LI background 
\ (sorted into Mandarin, Farsi, Korean, etc.), sex (sorted as male, female), or ESL 

; course level (sorted as beginning, intermediate, advanced). The category may be 

f represented by a number, but the number is arbitrary (1 may equal Mandarin, 
2 Farsi, and so forth). Nominal variables are discrete-one either is or is not a 
• V native speaker of Mandarin. Nominal variables are not meant to handle the 
j ) subtleties of degree. Rather, they are measured as frequencies. 

pfi 

In part I of this manual we talked about how we can display frequency data in 
terms of proportions, percents, rates, or ratios and how these might be shown in 
S various types of figures. The frequencies might be the number of beginning, 
| intermediate, and advanced students who used special listening comprehension 

' ; materials in a language lab. (We want to know if there is a relation between lab 

| \ use and student level.) They might represent the frequencies of special hedges 

f (such as perhaps , appears, seems) in science vs. economics textbooks. (We want 

, to know if there is a relation between type of hedge and text genre.) The fre¬ 

quencies could show a comparison between the number of people who answered 
"yes" or "no" on a questionnaire (where we want to know if response type and sex 
are related). 

I When we look at frequencies in a display table (e.g., the number of men vs. 

women who voted "yes" or "no"), they "look" similar or different. We turn to a 
\ statistical test, though, to tell us how different frequencies have to be before we 
can make claims about the relation of the variables with some degree of certainty. 


•Chi-square procedure 

Computing X 2 for one-way designs 
Chi-square for two-way designs 

• Yates' correction factor 
•Assumptions underlying Chi-square 
•Interpreting yf 

•Strength of association 
Phi 

Cramer's V 

• McNemar s test for related samples 
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I nr frequency data, an appiopiialc statistical proceduic to test the iclufionship' 
is the Chi-square ( y 2 ) test. Notice that it tests the relationship between the vari¬ 
ables (how well they go together) rather titan how one variable ajfects another.: 
I he Chi-square procedure docs not allow us to make cause -» effect claims. 

Assume you were about to design materials to teach relative clauses and you 
wanted to know whether the "noun accessibility hierarchy" is correct. Actually, 
you don't care about the more esoteric types shown in the hierarchy bu: onlv 
relative clauses that follow subjects (c.g., The team that wins the series will ad¬ 
vance to the finals), objects (c.g., I always like the team that wins the game), or 
the object of a preposition (c.g., The series tickets arc in the envelope that's on the 
table) since these are the three types that must be scqucnccd for teaching pur¬ 
poses. 

As a first step, you decide to use the Brown corpus as a data base since it is 
available for computer analysis. You randomly sample 10.000 sentences from the 
total corpus. Then, to find the relative clauses in these sentences, you use relative 
pronour.s (that, who, which, etc.) as prompts. You sort through the examples, 
that the computer lists to get rid of examples where the prompt word is not a 
relative pronoun (for example, where who is a question word rather than a rela¬ 
tive pronoun). Then you categorize the relative clauses by position. The table 
showing the frequencies might look like this: 


SUBJ NP OBJ NP PrepPH NP 
REL CL 442 334 94 

From these frequencies, it is obvious that there are more relative clauses following 
subjects and objects than following prepositions. Numerically, the frequencies in 
these ceils differ. Despite our confidence that this is so, vve will perform a test to 
be certain of this relationship between position and frequency. 


Research hypothesis? There is no relation between position and 
number of relative clauses. 

Significance level? .05 

/- or 2-tailed? Always 2-tailed for this procedure 
Design 

Dependent variable(s)? Relative clauses 
Measurement? Frequency tally 
Independent variable(s)? Position 

Measurement? Nominal—3 levels (following a subject, object, or 
prepositional phrase) 

Independent or repeated-measures? Independent (Chi-square re¬ 
quires that the data from each subject or piece of text appear in 
one and only one cell. The comparisons here are between cells.) 
Other features? Data source: randomly selected sentences from a 
large data base thought to be representative of written text. 
Statistical procedure? Chi-square test 
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Box diagram: 1 X 3 design (1 level of dependent; 3 of independent) 


REL.CL. I_I_ | 

SUBJ OBJ PrepPH 


Computing x 2 for One-way Designs 

The first step in the Chi-square procedure is to prepare a frequency chart such 
as the one below. As before, in a one-way design, the dependent variable is the 
flow and the independent variable is shown in the columns. The design is one- 
way—that is, we are investigating the relationship of the levels of one independent 
variable and the dependent variable. There is only one row of cells in the design. 

SUBJ OBJ PrepPH 

REL.CL. | 442 | 334 | 94 ~ j 


The second step is to decide what the frequencies would have been if there were 
no relationship between position and number of relative clauses. If the number 
of relative clauses were the same, we would expect that each position would have 
the same frequency (1/3 for subject, 1/3 for object, and 1/3 for prepositional 
//phrases). There are a total of 870 relative clauses, so there should be approxi¬ 
mately 290 in each position if the H 0 is correct. 


Obs. f Exp. f 
SUBJ 442 290 

OBJ 334 290 

PrepPH 94 290 


Looking at the actual distribution of relative clauses, the frequencies seem quite 
/different from this "expected" distribution. But, is the difference great enough 
/that we can feel confident in making claims that a difference exists? 

The third step is to find how far the number of relative clauses in each cell de¬ 
parts from this expected value. That is, we subtract the "expected frequencies" 
(£) from the "observed frequencies" (0). These values have been filled in on the 
following table: 



Obs.f 

Exp.f 

O- E 

SUBJ 

442 

290 

+ 152 

OBJ 

334 

290 

+ 44 

PrepPH 

94 

290 

-196 


You can probably guess what we do next. We square these differences from the 
expected values, and these values become the column of differences squared. 
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Obs.f Exp.f o i: (O E) 2 
SUBJ 44?, 290 f 152 23104 

OBJ 334 290 +44 1936 

PrcpPH 94 290 -196 38416 

In the formulas wc'vc worked with so far, -he next step would be to sum these 
values and find the average. Chi-square is different in this regard. It weights the 
difference in each cell according to the expected frequency for that cell. This 
means it gives more weight to those categories where the mismatch is greatest. 
To do this, we divide this squared difference by the expected value. 



Obs.f 

Exp.f 

O-E 

S' 

1 

0 

(O— E) 2 +-E 

SUBJ 

442 

290 

+ 152 

23104 

79.67 

OBJ 

334 

290 

+ 44 

1936 

6.68 

PrcpPH 

94 

290 

-196 

38416 

132.47 


As you can sec, we are comparing a table of frequency values with a table of ex¬ 
pected frequencies (where everything is equal as it would be if there were no re¬ 
lationship between the two variables). We compare these frequencies to discover 
how far each cell in the frequency table differs from its respective cell in the table 
of expected values. The difference for each cell is then squared. Next, we divide 
the difference value for each cell by the expected value for the cell and sum these. 
We can see that the weighted differences in the (O — f-y : E column are greatest 
for the prepositional phrase. 1 here are many fewer than we would expect. The 
next strongest weight is given to subject. 1 here are more than expected if the 
distribution of relative clauses was not affected by the type of noun phrase. 
T hen, we add the values in the final column (79.67 + 6.68 t 132.47 218.82). 

218.82 is the observed value of 

The formula that captures the steps wc have gone through is: 

-> vi (Observed - Expected ) 2 

r “ 2j -£- 

What does this x 2 value tell us about confidence in our finding? To explain this, 
we must turn again to the notion of distribution and probability. 

In our example, we know that if there were no relationship between position and 
frequency of relative clauses, each position would show an equal number of rela¬ 
tive clauses. Each position should have the same opportunity (the same proba¬ 
bility) to exhibit relative clauses as every other position. 

This expected distribution is, then, something like our normal distribution. We 
believe that the frequencies in each cell will be equally distributed if the levels of 
the independent variable are the same. The distribution may not be exactly that 
of the expected distribution since there is always "error." However, if the null 
hypothesis is correct, whatever variation there is will have nothing to do with the 
position of the noun phrase. 


396 The Research Manual 



Xhe question is how much variation from this expected distribution is normal, 
and how different should the cell frequencies be in order for us to conclude that 
position is related to frequency of relative clauses? The x 2 value tells us the 
magnitude of the variation in cell frequencies from the expected distribution. 
And, fortunately for us, mathematicians have figured out the probabilities for the 
distribution of x 2 values for all the possible number of levels of dependent and 
independent variables. They are presented in the x 2 table in appendix C (table 
7 )- 

Our x 2 value is 218.82. Turn to appendix C and find table 7, labeled "x 2 Distri¬ 
bution." In the first column of the table you will see the numbers for the degrees 
of freedom. The most probable value for x 2 with 2 df in this table is 4.605. The 
probability of getting this value is .10-one chance in ten— if the levels (positions 
in this case) are the same. The next column for 2 df shows a value of 5.99 and 
the probability of getting this value, a value close to the expected value is .05—one 
chance in twenty. As you continue across the table, you sec that the x 2 values 
are getting larger—the difference from the expected value is getting larger too. 
The probability, the chances of getting the value, if the levels arc the same, begins 
to drop. The higher the yf value, the lower the probability that the levels are the 
same (that is, the easier it will be to reject the null hypothesis). 

||Vhcn we get to the very last column, we see that with a y}_ value of 13.816, the 
chances are indeed slim that the levels (positions) are the same. The probability 
level has sunk to .001. 

This means that there is only 1 chance in 1,000 that you could get a value as high 
as 13.816 if the number of relative clauses were the same in each position. It's 
very unlikely that you would get a result this far from the expected values if the 
levels were the same. Another way of saying this is that there is 1 chance in 1,000 
fthat you would be wrong if you said that the number of relative clauses (in these 
data) differs in the three positions. Or, there are 999 chances in 1,000 that you 
would be wrong if you said that there was no difference in the number of relative 
clauses in the three positions. Our x 2 value is 218.82, and that's so high that the 
probability isn't even on the page! Perhaps 1 chance in 1,000,000. We can feel 
great confidence in rejecting the null hypothesis. The number of relative clauses 
is definitely related to position in this text data base. 

If you use this example as a guide, there is one thing to keep in mind: the sticky 
issue of independence. To assure independence, we randomly selected sentences 
from a very large data base. Assume, instead, that we had analyzed the relative 
clauses in one short story-a typical repeated-measures design where the same 
person produced all the relative clauses. The data would not be independent 
across cells and a Chi-square analysis should not be done. To avoid this, you 
might decide to analyze the relative clauses in science texts, a subcategory of the 
Brown corpus. Each passage is produced by a different writer. Again, there is 
a problem of independence, this time within the cells. It is possible that one 
passage by one author would contribute a very large number of relative clauses 
find this might radically influence the measure of relationship. In this case, the 
pata within cells would not be independent of author. One author, because of 
personal writing style, might unfairly influence the outcome. (To avoid this 
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problem in our example, we randomly selected sentences from the total data base 
and hoped to obtain no more than one sentence per author.) 

The example we have used is unusual in that it is a one-way design. In fact, we 
very rarely have one-way Chi-square designs in applied linguistics research. 
Following the practice, we will consider more complex designs. 

ooooooooooooooooooooooooooooooooooooo : 

Practice 14.1 

1. If there were just two relative clause positions in the example (using random 
sampling techniques), how many of each would we expect to have if the total 
number of relative clauses in the example (using a random sample) was 870? 
_If there were five types?_ 

If we have no special information that tells us otherwise, our best estimate of the: 
number of relative clauses for each can be calculated by 


2. Turn to the probability chart for y 2 values. If the y 2 value for our example 
were larger than 9.21 but less than l3.Slt>. the probability level would be .01. 

I his means that you would have I chance in 100 of being wrong if you rejected: 
the null hypothesis. How many chances of being right would you have? 


If your x 2 value were 6 1. how many chances would you have of being wrong if 
you rejected the null hypothesis?_. 

3. Imagine that you wanted to compare the frequency of other types of relative: 
clauses as a test of the the NP (noun phrase) accessibility hierarchy. Keenan and i 
Comrie (1979) claimed that there is a hierarchy in terms of which noun phrases 
can be most easily relativized. The hierarchy is based on the types of relative 
clauses that appear across languages. Those most accessible should be those that 
appear in most languages. The less accessible the NP, the fewer the number of 
languages that should allow relative clauses for the NP. (Predictions are similar 
to those tested with a Guttman scalogram, right? That is, if a language allows 
relative clauses with indirect objects, it should also allow them with direct objects: 
and subjects.) We should, then, be able to use the hierarchy to determine which : 
types of relative clauses should be most difficult for second language learners. It 
is also possible that those highest in the hierarchy would also be those most fre¬ 
quently found in texts and those lowest in the hierarchy would have the lowest 
frequency. The hierarchy predicts the following order: subject NP, direct object 
NP, indirect object NP, oblique object NP (i.e., object of preposition), genitive: 
NP (i.e., possessive), and object NP of a comparison. The degrees of freedom for 

positions of relative clauses is now_. Suppose that the y 2 value for 

the study, using a random sample of 10.000 sentences, was 10.276. The null hy¬ 
pothesis is that there is no difference in the frequency of relative clauses across 
these six types. That is, position is not related to the frequency of relative clauses. 
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What is the probability that you would be wrong if you rejected the null hy¬ 
pothesis?_Would you reject the null hypothesis? Why (not)?_ 


For practice, you might want to recalculate the values for the expected cell fre¬ 
quencies with six types, add fictitious data counts, and recompute the value of 
f 2 for relative clauses. 

<0000000000000000000000000000000000000 

Chi-square for Two-way Designs 

In our examples so far, wc have used Chi-square to check to see if frequencies on 
one variable (the dependent variable) change with levels of another independent 
variable. Let's turn now to an example which compares the relation of frequen¬ 
cies for two variables, both of which have several levels. Since levels of two var¬ 
iables are compared, this will be a two-way design. 

Imagine your school district has entered a national campaign to encourage read¬ 
ing in the elementary schools. The school district wants to use this opportunity 
to study the relation of type of reward to increased reading. Students can select 
a material reward (stickers, chips, etc.), peer recognition (honor list with stars, 
videolog star of the week in reading), or special privileges (computer use, 
videorecording, etc.). Since there are children from several different ethnic 
groups, the school also wonders if this might have any relation with choice of re¬ 
ward. The children report the number of books they read each month and select 
Whatever reward they wish. 


Research hypothesis? There is no relation between ethnicity and 
choice of reward. 

Significance level? .05 

1- or 2-tailed? Always 2-tailed for this procedure 
Design 

Dependent variable(s)? Reward choice 
Measurement? Frequency tally 
Independent variable(s)? Ethnicity 

Measurement? Nominal-3 levels (Vietnamese, Mexican- 
American, Chinese) 

Independent or repeated-measures? Independent 
Other features? Intact class 
Statistical procedure? Chi-square 
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Box diagram: 


Vietnamese Mex-Am Chinese 

Material 

Recognition 

Privilege 


Here are fictitious data on three groups of fourth grade children: 

Reward Type Vietnamese Mex-Am Chinese 

Material 2 21 27 

Recognition 8 17 7 

Privilege 26 II 6 

The above table gives us our observed values for each cell. To discover the ex¬ 
pected values, we can't just divide the total frequencies this time by the number 
of cells. The expected values depend on the number of children in each group- 
and the number of children in each group selecting each reward type. 

f irst, wc total the frequencies across the imvs and enlct this as a row total: 


Reward 

Vietnamese 

Mex-Am 

Chinese 

Total 

Material 

2 

21 

27 

n\ = 50 

Recognition 

8 

17 

7 

n2 = 32 

Privilege 

26 

11 

6 

n3 = 43 

Total 

n\ =36 

«2 = 49 

«3 =40 

N = 125 


At the bottom of each column arc the number of Vietnamese responses (w,), and 
so forth. If wc sum the row frequencies, we can enter this under /V in the bottom 
right-hand corner of the table. If wc add the column frequencies, we should get 
the same value for /V. These column totals and row totals are called the 
marginals. 

To find the expected value for each cell, multiply the row total by the column 
total and divide this by N: 


n i x n j 

~N 


where E = expected frequency, i = row, and y—column. 


Let's do this for the first cell in the upper left corner (/for Vietnamese selecting 
a material reward). The row total for material is 50. This equals n ( . The column 
total for Vietnamese is 36. This equals n jS . Multiplying n t and n t gives us 1,800. 
When we divide this by N (the total frequency for all groups on all rewards), the 
answer is 14.4. We place this number in the boxes drawn for the expected values : 
below. 
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1 

Vietnamese 

Mex-Am 

Chinese 

ylaterial 

14.40 



Recognition 




privilege 





36 

49 

40 


50 

32 

43 

125 


let's go to the next cell, the number of Mexican-American children selecting a 
fiatcrial reward. The row total is again 50, but the column total this time is 49. 
We multiply these and divide by N and insert the number in the corresponding 
pil below. Then we repeat the operation for the next cell, the number of Chinese 
|tudents selecting a material reward. 

When wc move to the next row, the row total has now changed. The row total 
for recognition is 32. When we multiply this by the first column total and divide 
by N, we have the expected frequency for the number of Vietnamese children se¬ 
lecting a recognition reward (9.22). We continue on until all the cells are filled 
for the expected frequency table. 


Material 

Recognition 

Privilege 


Vietnamese 

Mcx-Am 

Chinese 

14.40 

19.60 

16.00 

9.22 

12.54 

10.24 

12.38 

16.86 

13.76 


Now we can find the differences between the observed frequencies and the ex- 
ffjjected frequencies in each cell. These are shown in the O — E column in the table 
below. The next step is to square each of these values and enter the results in the 
column labeled (O — Ef. Then we divide each (O — Ef value by the expected 
lvalue for that cell and enter the answers in the last column of the table. Finally, 
we can total the values in this last column to obtain the value of y 2 . 


Row 

Col 

Obs 

E 

O-E 

(O-E) 2 

(O-E) 2 IE 

1 

1 

2 

14.40 

-12.40 

153.76 

10.68 

1 

2 

21 

19.60 

1.40 

1.96 

.10 

ri 

3 

27 

16.00 

11.00 

121.00 

7.56 

;:V 2 

1 

8 

9.22 

-1.22 

1.49 

.16 

#2 

2 

17 

12.54 

4.46 

19.89 

1.59 

1,2 

3 

7 

10.24 

-3.24 

10.50 

1.03 

l 3 

1 

26 

12.38 

13.62 

185.50 

14.98 

3 

2 

11 

16.86 

-5.86 

34.34 

2.04 

3 

3 

6 

13.76 

-7.76 

60.22 

4.38 


42.52 


Consulting the (0 — Ef 4- E column, we see that the largest weightings are for 
cell 1-1 and cell 3-1. These relate to the Vietnamese group. Cell 1-1 shows they 
chose fewer material rewards than expected and (cell 1-3) more privilege rewards. 
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Cells 1-3 and 3-3 iclaic lu llic Chinese gioup which those iiioi c material if wards 
and fewer privilege rewards than expected. 

The next step is to find out whether we can or cannot reject the null hypothesis. 
That is. are these weightings strong enough that we can say that the three groups 
do differ in their choice of rewards? You'll remember that the x 2 value has to 
be placed in the appropriate Chi-square distribution. To do that, we need < 0 
know the degrees of freedom. Wc can't just subtract A.' l for groups. Wc also 
have to subtract K I for type of reward. In the study there are three groups- 
of 5s. So the degrees of freedom for groups is 3 1 - 2 (that's the hard part!). 

There are three types of reward. So the degrees of freedom for type of reward is 
3-1=2. To find the total degrees of freedom, we multiply these together. 
Perhaps an easier way to remember how to do this is take the number of rows 
minus I and the number of columns minus 1 and multiply them to find the de¬ 
grees of freedom; i.e., (3 - 1X3 - l)— 4 . The Chi-square distribution in which I 
we will locate our x 2 value is that which mathematicians have determined for 4 
df 

Now we turn to table 7 in the Appendix and find the row marked for 4 df and 
look across the table until we find the location of the y 2 value needed for the .05: 
level. The probability level listed is 9.49. Since wc selected the .05 level, wc can 
feel confident in rejecting the null hypothesis because our * 2 of 42.52 is larger^ 
than 9.49. 

If this were a real study, we could conclude that: 


1. The two variables arc related (ethnic membership is related to choice of re¬ 
wards offered as an incentive to encourage free reading) for our sample 5s. 

2. If the 5s were randomly selected and if all threats to internal and external 
validity had been met in the study, we would claim that our results could be 
generalized to other 5s of the same ethnic groups. Since this is not the case 
here, wc will use the statistical procedure to be confident that our description 
of the results for our 5s is correct. 

3. Even though the choice of reward was related to ethnic group membership, 
wc cannot say that ethnicity caused the choice of reward. Nor do we know 
just how much weight to give ethnicity in terms of choice of reward. Other 
var.ables (e.g., sex, socioeconomic status, intrinsic vs. extrinsic motivation) 
might be as important or even more important in relation to choice of reward. 

4. We are surprised (if we are) that the results turned out this way and we would 
like to replicate the study, taking greater care to be sure we have a represen¬ 
tative sample, that we have better operational definitions for reward types, 
and so forth. 

In practical terms, given that we cannot generalize, we might want to continue to 
give everybody whatever reward they want to select. Still, we should be willing 
to share the results with others, giving, of course, fair warning regarding lack of 
general.zability. 
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l&oooooooooooooooooooooooooooooooooooo 


Practice 14.2 

► 1. 5s in a study were categorized as being either analytic problem solvers, 
holistic problem solvers, or some mix of these two types. To sec whether problem 
solving style related to the acquisition of discourse skills, these 5s were given a 
number of communication tasks. A composite score for each 5 was obtained and 
S were then given a "grade" based on their composite score. The data classify the 
5s as follows. 


Problem Solving Type 


Grade 

Holistic 

Analytic 

Mixed 

< 25 (D) 

11 

9 

10 

25 - 49 (C) 

12 

11 

11 

50 - 74 (B) 

35 

34 

30 

> 75 (A) 

32 

37 

28 


fis performance on communication tasks related to problem solving styles? State 
(the H 0 , do a Chi-square analysis, and state your conclusions. 


(Can you generalize the findings? Why (not)? 


(2. The following table is one of many presented by Shibata (1988) in her report 
qf language use in three ethnic Japanese language schools. The table is based on 
00 minutes of coded interactions per classroom. Three classrooms were observed 
[in school 1, 3 classrooms in school 2, and 4 classrooms in school 3, for a total of 
if 0 classrooms. 


Student Utterances by Language and School 
School 1 School 2 School 3 


English 

91 

17% 

69 

14% 

339 

45.5% 

Japanese 

442 

82% 

406 

85% 

403 

54.1% 

Mixing 

4 

1% 

4 

1% 

3 

.4% 

Total 

537 

100% 

479 

100% 

745 

100% 


The basic question is whether the amount of Japanese and English usage differs 
in the three schools. We cannot use a Chi-square analysis of these data because 
we do not know whether certain children in the classes may have contributed 
more (or fewer) mixed, Japanese, or English utterances. If we counted the num¬ 
ber of children who used Japanese, the number who used English, and the num- 
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bet who used both Japanese and KnglKh, we could <lo a Chi-square analysis, 
Explain why this is so. How much information would be lost if we did this? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO : 

Yates' Correction Factor 

In our examples, we have taken care that :he degrees of freedom should always; 
be greater than I. Yet, it is possible we might want to compare frequencies where; 
the independent variable has only two levels (where K I = 1 df) or when we 
have a 2 X 2 design (where K 1 = 1 and K - I = 1 so that 1x1 = 1 df). In 
the first case, the design is one-way and in :he second case the design is two-wav,: 
but in each the degrees of freedom I. With I df, the comparison between the 
Chi-squarc distribution and the observed y 2 value is not as close as mathemati¬ 
cians would like. So, there are special correction procedures to take care of the: 
discrepancy. This is called the Yates' correction factor. Most computer routines: 
build in this correction since it compensate-* for cell sizes less than 5 and doesn't 
negatively affect the outcome. 

F or one-way designs the correction factor is simple. If the observed value (O) is 
greater than the expected value (10, subtract .5 from O and enter this as the 
corrected O." If the observed value (O) is less than the expected value (E), add 
.5 to O and enter this as the "corrected O" value. Then continue computing the 
X 2 using the corrected values of O. 

Actually, researchers seldom run a one-way Chi-square with only two cells (and 
a c/fof 1). The reason is that one can easily look at the raw frequencies and tell 
whether one cell has more than 50% of the frequencies. There is no real reason 
to run a Chi-square test and a correction procedure to determine if this is the 
case. 

If you have a 2 X 2 design (2 levels of one variable and 2 levels of the other), you : 
also have I df. A correction is required. The method by which we calculate this 
correction, however, is slightly different than for a one-way design. Instead of 
just adding or subtracting .5 from observed values, we have an amended Chi- 
square formula. Here it is in all its glory: 

2 N{\ad- bc\ - JV -f 2) 2 
* {a + b\c + dfa + c\b + d) 

To understand all the as, bs, and cs, look at this diagram of a 2 X 2 table- 
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a+b 


c + d 


The letters a, b, c, and d represent the frequencies that fall into the four cells in 
the design. The two-letter symbols in the margins simply represent the sums of 
the cells in each row and column. 

To see how this works, consider the following example. To encourage language 
minority students to continue their studies at the university, special tutorials were 
offered to students identified early as being "at risk." At the end of their fresh¬ 
man year, some "at risk" students were placed on probation and others not. The 
research question is whether students who took advantage of tutorial help were 
more likely to succeed (i.c., not be placed on probation). Here arc the frequen¬ 
cies. 

-1-Tutorial -Tutorial 
+Probation 34 73 

— Probation 432 377 

iliet's plug these frequencies into our formula: 

2 916(134 x 377- 432 x 731- 916 ~ if 

1 ~ 107x 809x 466x450 

2 916(18718- 458) 2 

* 18,152,261,000 

2 305,419,680,000 

[ft 1 18,152,261,000 

| x* = 16.825 

The critical value of x 2 for 1 4f is 3.84. The value we obtained is higher, so we 
can reject the H 0 at the .05 probability level. We can conclude that for these 
Students there is a relation between tutorials and probation--fewer students who 
participated in tutorials were placed on probation. 


ooooooooooooooooooooooooooooooooooooo 

Practice 14.3 

► 1. To better understand student dropout rates, a questionnaire was adminis- 
pred to a (stratified) random sample of 50 Ss who had dropped out of school and 
§50 who continued in the sophomore year. Each questionnaire item was subjected 


Var Y 


Var X 


b + d 
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iii Chi-Square analysis. Here are the frequeneies for a question regarding work, 
hours. 

Wk < 4 hrs Wk > 4 hrs. 

Drop 19 33 

Continue 31 I" 

In the space below write out the formula and do the calculations. 


The value of % 2 is_. What is the critical value needed to reject the null 

hypothesis? __. Can you reject the null hypothesis?_ _ 

What can you conclude?_ 


In this ease, do you think you can generalize the findings? Why (not)? If so, :o 
what population can you generalize?_ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Assumptions Underlying Ctii-sqiiarc 

As always, there are some cautions to remember before using the Chi-square 
procedure. The following assumptions must be met: 

/. The data must consist of frequencies. 

The Chi-square procedure is appropriate when the data arc counts of numbers 
of items or people in particular classifications or cross-classifications. Do not at¬ 
tempt to perform a Chi-square analysis using proportions. Change the pro¬ 
portions to raw frequencies and then perform the analysis if this is appropriate. 
Or, find a different procedure that is appropriate for proportions. 

It is not appropriate to count the frequency of items which come from different 
5s if one 5 might contribute more than his or her rightful (expected) share to the 
tally. For example, you cannot perform a Chi-square analysis of the relationship 
of sex of 5 to number of times 5s volunteer answers in the classroom. One boy : 
might contribute a large number of volunteer turns to the data while others in 
his group might contribute very few. The same might happen in terms of the 
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•number of turns taken by the girls. Instead, count the number of 5s who take 
Iturns and the number of Ss who do not and perform the Chi-square analysis on 
These data. 

M F 

+ Turns 
— Turns 


ibr, count the number of turns each S takes in, say, an hour class period and use 
Fthcsc to set levels for the dependent variable. 

No. Turns M F 
| . <5 

5-9 
10-14 
> 15 


|in the above contingency table, the data that go in each cell are frequency data. 
gDon't be confused by the rank scales on the side of the table. These are levels 
|bf the dependent variable set on an ordinal scale, but the data consist of fre- 
ijucncy counts. 

|ks you can see, we are trying to meet the same criticism mentioned earlier for text 
fanalysis studies. The data in such studies are traditionally sampled across a 
Igenre. It is possible that some individual samples may contribute more heavily 
ifhan others to the frequency counts and thus give the researcher a spurious re¬ 
lationship where none exists. While tradition appears to approve of this use of 
Vthe Chi-square procedure for such data, it seems inadvisable to us. Your research 
|advisor or consultant may also feel that a Chi-square analysis is inappropriate for 
|uch data. (In other words, don't just accept tradition. Check with your advisors 
fbefbre attempting the analysis in such cases. You don't want to put your re¬ 
search at risk by making the wrong choice.) 

p. The categories should form a logical classification. 

lit is simple to see that two levels for sex-male and female-form a logical classi¬ 
fication. Oral and written modes also form a logical classification. LI member¬ 
ship (Chinese, Japanese, Korean, Farsi, etc.) forms another logical classification. 
However, in some studies it might be more logical to group subject-prominent 
languages in one group, topic-prominent languages in another, and those which 
|Use both into a "both S-P and T-P" group. The logic of the classification is related 
|to the research question. 

IWhen the research question relates to grammar categories such as clause types, 
|the logic of the classification may not be so easily determined. How many dif¬ 
ferent clause types form the possible "set" of clauses (as in our relative clause ex¬ 
ample)? Why are some types included in studies and others not? Again, the 
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overall logic of the classification must relate to the study and be justified by tl.e 
researcher. 

3. Whenever the frequency of an event is counted, the frequency of nonoccurrence 1 
must also be counted. 

This is a fairly simple matter to remember. For example, if vve want to discover 
the relationship between the number of 5s who pass an ESL course and Ll 
membership, we must count the number who pass and the number who do not 
pass. If we want to look at the relationship of sex and the number of Ss who take 
turns in class, we need to count the number of boys and the number of girls who 
take turns and the number who do not. In the case where there are several levels ; 
of a variable, this may not be so obvious. The frequencies in one level contrast; 
with all those in the other levels. In LI membership, the number of Chinese 5s 
contrasts with 5s who are not Chinese (Japanese, Korean, Farsi, etc ). The 
number who are Japanese contrasts with those who are not Japanese (Chinese, 
Korean, Farsi, etc.). As you can see. this reinforces assumption 2. If the category: 
levels are not logical and complete in terms of the research questions, it will be¬ 
come impossible to meet assumption 3. 

4. The data must be independent. 

There are two ways in which this assumption might be violated. First, there 
might not be independence between the ce.ls in the design. Second, there might: 
not be independence of data within each cell. Let's look first at the problem of; 
maintaining independence between the cells in the design. 

Each 5 or observation can be entered in only one cell. That is, the 5s in our study 
of reward type were tallied under only one type of reward. The same 5 couldn't 
be entered under material and also under recognition. If a student wavers back 
and forth in reward choice, we would have to add another level to reward type-: 
perhaps material + recognition— and place that person in this category. If you; 
have sent out a questionnaire and you have questions for which you want :o 
compare the number of people who answer "yes" and "no," you can t enter a 
person in both categories. You can add the response category "undecided’' and: 
place the person there. If you arc looking at semantic functions of if clauses , you; 
can enter each if clause under one and only one function. 

In addition, you can’t count frequencies on a variable at one time, count them 
again later and compare the frequencies using Chi-square. That is, a regular 
Chi-square analysis is not appropriate for repeated-measures designs. Thus, you 
couldn't count the number of ESL students who used the language lab during; 
their introductory course and the number of ESL students (the same students) 
who used it one year later and compare these with a Chi-square analysis. 

The data must also be independent within each cell. As an example, assume we 
wanted to look at changing forms in pidgin languages. To do this, we collected 
data from children, adolescents, adults, and elderly adults. We singled out some 
area where we expect change may be found-perhaps the emerging use of one 
certain modal. We expect to find the highest frequency in the data of children : 
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I^rith a decrease as we cross Ihe age groups. We go through the data counting the 
presence and absence of the modal (where native speakers would use the modal). 
The table is divided into eight cells (four age group levels and two 
presence/absence levels for the modal). Again, each piece of data goes into one 
and only one cell of the box design. However, it is possible that certain 5s con¬ 
tributed more than their fair share to the data within each cell. There is not in¬ 
dependence of data within the cell. This lack of independence threatens the 
validity of the study. 

As a second example, assume we ran a concordance program on the Brown cor¬ 
pus searching for instances of "resemble" in hopes of validating the finding that 
in sentences such as "X resembles Y" (which should be symmetric with "Y re¬ 
sembles X"), the Y is always the standard of comparison. Thus "Y resembles X" 
is not a synonymous form. The data base might be limited to the "science" and 
"science fiction" sections of the Brown corpus. This gives us four cells (where Y 
is or isn't the standard of comparison in science or science fiction text). Again, 
the frequencies fall within one and only one box of the design, but there is a 
problem in that certain authors within the science or science fiction base might 
be fond of the word "resemble" while others might never use such constructions. 
|fhe counts within the cell arc not really independent. The data are not repre¬ 
sentative of the category (science or science fiction text) but only of certain con¬ 
tributors in each category. The validity of the study is at risk if the data are 
tiihalyzcd using a Chi-square analysis. 

ft he data must be independent. If the data are not independent across the cells, 
f|ou might be able to use McNemar's test (which we will illustrate in a moment) 
for CATMOD (categorical modeling), a special procedure available on computer 
programs which will be discussed in chapter 16. If the data are not independent 
swithin the cells, you might want to use solutions such as those shown on page 
407. Another option, if you convert such data to rates (mean number of instances 
per 1,000 words, number of instances per 100 clauses, etc.), might be to use 
nonparametric procedures presented in chapters 9 through 12 of this manual. 

5. The sample size must be large enough to obtain an expected cell frequency of 
five. 

In cases where you have small sample sizes, some of the expected cell frequencies 
may dip below five. If this happens and your design has only 1 df the best thing 
to do is to use Fisher's Exact test. This is available in SAS, or you can read about 
it in Hays (1973), Siegel (1956), or Kendall and Stuart (1977). If your study has 
more than 1 df and some of the cells have an expected cell frequency less than 
5, it is possible that some of the cells could be collapsed without damaging your 
study. For example, if you used a separate cell for every grade from kindergarten 
through 12, you would have 12 cells. You might be able to group these into K-3, 
4-6, 7-9, 10-12. This would, in all likelihood, give you a large enough expected 
cell frequency to allow you to apply the Chi-square test. 

When you do a Chi-square procedure by hand, you will have to remember to 
fqheck the expected cell size frequencies yourself. If you run Chi-square on any 
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computer program, you should automatically receive a warning message when the 
cell si/cs drop below an allowable level. 

Interestingly, not all statisticians agree on the minimum size allowed for Chi-: 
square. Some writers suggest that if the df is only 1, a minimum expected fre¬ 
quency of 10 (not 5) be used, and 5 is the minimum for designs with more than 
I df This seems rather conservative, but do not be surprised if your advisor or 
statistical consultant suggests an expected cell size of 10 for a study with 1 df. 
Others specify 5 as the minimum for 1 df and 2 as the minimum for larger df 
On some computer printouts you will find a warning only when 20% of the cells 
have dropped below an expected cell frequency of 5. Again, be sure to discuss 
this issue with your research advisor or statistical consultant prior to interpreting ; 
your results. 

6. When the number of df equals l. apply the Yates' correction factor. 

We have already discussed how to compute the correction in the previous section. 
Most computer packages automatically make this correction in carrying out the 
procedure. 

Given these assumptions, you should now be able to draw design boxes for Chi- 
square contingency tables and decide on the appropriateness of the procedure for 
data. Before doing this, though, complete the following practice. 

ooooooooooooooooooooooooooooooooooooo 

Practice 14.4 

1. Consider an example where a foreign language department surveyed student 
interest in taking conversation classes in foreign languages. A Chi-square analy¬ 
sis was used to relate interest of 5s (low, moderate, high) to the information they 
gave on each questionnaire item. What advice would you give the department 
regarding one question which relates interest in such classes to whether or not 5s 
have friends who speak the language. The following contingency table data box 
show the choices. Cell 1, for example, would show the number of 5s who have 
low interest in conversation classes and who have friends who speak the foreign 
language. 


Ss' Interest + Friends spk lang. -Friends spk lang. Have no real friends 
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|>2. Draw a Chi-square contingency table (design box) to match each of the fol- 
llowing (fictitious) study descriptions. Feel free to set up your own levels of the 
dependent and independent variable. There is no one "right" design. Compare 
your contingency tables with those given in the answer key. 

I Example A 

iThis study tests the relationship of geographic area to the importance attached 
fto bilingual education. In a national survey, respondents were asked how they 
would vote on the issue of mandatory bilingual education in elementary schools. 


I 


Example B 

A study related the choice of university major to the first language membership 
of foreign students. 


Example C 

An interlanguage study related the "age of arrival" (i.e., how old each S was on 
arriving in a country where the second language is spoken) to stages of negation 
(taken from Schumann's [1979] stages of no + verb, unanalyzed don't, aux + neg, 
analyzed do + neg). 
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Example D 

The study relates the cognitive styles sometimes called "field-independent" vs. 
"field-dependent" to sclf-corrcction of errors in revision of compositions. 


► 3. Review each of the assumptions that must be met in order to use a Chi- 
square procedure. Then, decide whether Chi-square is an appropriate test for the 
data described in each of the following examples. Give the rationale for your 
answer. !f the procedure is not appropriate, show how the study might be 
amended so a Chi-square procedure could be done. 

Example A 

Students in community adult school programs were asked about their reactions 
to group-work activities scheduled in ESL classes. Students voted for, against, 
or undecided regarding continuance of these activities. The students came from 
three major LI groups: Spanish, Far East languages, and Middle East languages. 
The question is whether students from these three backgrounds have different 
views of the value of group work in language classes. 


Example B 

The school district keeps records of testing related to language disorders. Among 
many tables are those that show the number of bilingual children for whom 
stuttering is a problem only in LI, only in L2, and in both LI and L2. The tables 
also show the number of boys and girls in each category, since sex is expected to 
relate to stuttering. 
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Example C 

)n chapter 5 (page 137) we presented a table showing the frequencies for seven 
clause types in oral and written samples of fourth-grade children. The null hy¬ 
pothesis is that there is no relationship between language mode (written vs. oral) 
and frequency of clause types (seven types of clauses) in the data. 


Example D 

You taught your ESL students using two different methods-a teacher-centered 
approach and a cooperative-learning approach. You counted the number of 
clauses the students produced in each lesson. You hoped to show more student 
participation in the cooperative-learning session than in the teacher-centered ses¬ 
sion. 


4. Look back through your own research questions and find those for which the 
Chi-square test might be appropriate. Check with your study partners to be sure 
you can meet the assumptions. If you are not sure, note your uncertainty below 
and discuss with members of your study group. 


List the number of cells in each of the studies for which you think the Chi-square 
is an appropriate procedure. Will correction factors be required?_ 


ooooooooooooooooooooooooooooooooooooo 

Interpreting y 1 

As we have seen in a few of our earlier examples, there may be several variables 
(independent and moderating variables) which we would like to relate to the de- 
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pendent variable. I his means that the table for ttie observed frequencies will 
have several cells. 

In the example of relation of position and frequency of relative clauses we had 
three cells (three different NP positions). There were 442 relative clauses in sub¬ 
ject position, 334 following direct objects, and 94 with prepositional phrases. In 
our research article, we would report the results of our statistical test as: y 2 » ; 
218.82, df = 2, p < .05. When the y 2 value allowed us to reject the null hy¬ 
pothesis, we could say that the frequency of relative clauses differed in these three 
positions. From the frequencies, it is obvious there were fewer relative clauses 
after prepositional phrases than following subjects or objects. It is this cell which 
shows the greatest difference from the expected cell frequency. In addition, the 
(O-Ep+E column shows that subject and object NPs have more relative 
clauses than expected and that prepositional phrase NPs have fewer relative 
clauses than expected. 

This is a legitimate interpretation of the y : value for this study. The problem is 
that the full table is seldom included in research articles, so it is difficult for the 
reader to know whether or not the interpretation is correct. The writer should 
explain the basis for the interpretation when the table is not included. 

The table makes it possible for us to interpret the data quite easily, but this isn't 
always the case. Sometimes we want to compare the relation of several inde¬ 
pendent variables or several levels of one independent variable to the dependent 
variable. This was the case in our (fictitious) example of ethnic background and 
reward preference. There were nine cells in the design. We found wc could reject 
the null hypothesis (that is, there was a difference in preference of reward type : 
according to ethnic background), but we could not say exactly where the differ¬ 
ence occurred by looking at the y 2 v alue. Again, the research article might simply 
report the results of the statistical test as: // 42.52, df - A, p < .05. I hat tells 

us that the frequencies in the cells differ from those expected (given the total: 
number of responses and the degrees of freedom). It doesn't say where the dif¬ 
ference is. It might be in any of the nine cells or in some combination of them. 

The best thing to do (when you need to locate precisely which cells compared with 
which other cells resulted in a high y 2 value) is to carry out a Ryan's procedure 
(see Linton & Gallo. 1975). This procedure compares the frequencies in all the 
cells against each other in ways that allow you to see exactly where the differences: 
are the greatest (most significant). 

However, at this point what is most important is that you use the probability 
level to give you confidence in accepting or rejecting the null hypothesis. Further 
interpretation of the actual cell frequencies can be done by looking at the 
{O — Ef ~ E values for each cell (those that are largest are those which depart 
most from the expected cell frequency). You can interpret your computer print¬ 
outs for x 2 in exactly this way. 

Once you have located the cells where the {O - EY ~ E are the largest, be sure 
to note whether the value is higher or lower than the expected value as you give 
your interpretation. In the reward type study on page 401, cell I (row 1, column 
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t) and cell 7 (row 3, column 1) made the largest contribution to the y 2 value. 
Compared with the other two groups, the Vietnamese students selected fewer 
fftiaterials rewards and more privilege rewards than expected. If you wish, you 
lean request (from your consultant or from the computer) a Ryan's procedure 
Analysis to accomplish the same thing. 

|Xhe Chi-square test gives us a useful way of dealing with frequency data in a 
Isystematic way. It allows us to talk about frequencies not just in terms of per¬ 
cent, proportion, or ratio, but in terms of whether or not the frequencies reflect 
a relationship between variables. 

Strength of Association 

Chi-square allows us to test the null hypothesis of no relationship between vari¬ 
ables. When we reject the null hypothesis, we can conclude that a relationship 
exists. A Ryan's procedure can show precisely where the relation differ^ from 
that "expected." However, the test does not show how strong the relationship 
fbetween the variables in the present research might be. 

fWhen the x 2 value is significant, we can estimate how strong the association be¬ 
tween variables is for a particular data set by using association measures. For 
research with 1 df\ you may use Phi (0). For research with more than 1 df use 
fCramcr's V. 


Phi (<&) 

When the y 2 is significant (the H a can be rejected), it is possible to determine the 
association between the variables for your particular data set. The formula is 
Very simple since the information needed is already available on your data sheet 
or your computer printout. (In fact, your computer package may automatically 
generate these strength of association figures.) 



(X 2 equals the y 2 value itself, not the value squared.) Let's apply the formula to 
our Ch:-square analysis of the relation between dropout rates and work hours 
(page 405). 


0 = 


0 - 
0 — .26 
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The two variables (in these data) share a 26% overlap. Thai’s a fairly strong 
relationship, but it should lead us to wonder what other variables might account 
for the remaining 74%. 


Cramer's V 

Phi isn't an accurate strength of association measure with tables where the df is 
greater than I. Cramer's V should be used for these tables. You need to calcu¬ 
late Phi first, and then use the following formula: 

Cramer's V = 


Use the instructions above to calculate Phi. The (min r — 1, c I) means 1 from 
either the number of rows or the number of columns in the design, whichever is 
smaller. 

In the problem regarding the relation between ethnicity and reward type (page 
399), there are three rows and three columns. Since they aic the same, the de¬ 
nominator for the problem is 2. and <P is .58, so (Tamer's V is: 

Cramer's V ~ 


Cramer's V = 

Cramer's V= .41 

Again, the relationship was significant so a relation exists between the variables. 
Cramer's V shows us that not only docs a relationship exists but that it is a strong 
one. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 14.5 

► 1. Calculate Phi (<P) for the tutorial example on page 405. The x 2 value ;s 
16.825. _ 
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%y2. Imagine that the y 2 value for the problem solving example in practice 14.2 
|was 13.01 instead of .901. Then calculate Cramer's V. _ 


i^oooooooooooooooooooooooooooooooooooo 

;With the Chi-square procedure, as with the procedures presented in part II, we 
shave used a statistical test to test our hypotheses. It is extremely important to 
Iknow which statistical test to select in testing hypotheses. The Chi-square test is 
appropriate when the data are nominal and the measurement is in terms of fre¬ 
quencies. The data must be independent. 


McNemar's Test for Related Samples 

Here is an example where the data arc not independent. A student you tutor has 
great trouble with verb tense. You ask her to underline every verb in a compo¬ 
sition you have just returned to her. Then you ask her to correct any errors that 
she sees. This makes it possible for you to do a frequency count of the number 
i/of verbs which were incorrect in the original and which are now correct, the 
''•number incorrect which are still incorrect, the number correct which are still 
^correct, and the number correct which are now incorrect. The comparison is be¬ 
tween the original version (time 1) and the revised version (time 2) and so the 
• comparison is "within" 5s, a repeated-measures design. 


; The table might look like this: 


Revision 

Correct Incorrect 

4 4 


Look at the number in the lower left of the table. It shows that 14 items were 
incorrect in the original and are now correct. Three items were incorrect and still 
are incorrect. The top row shows four items were correct and are still correct, and 
four were correct and are now incorrect. The McNemar's analysis checks these 
changes and uses an adjusted y 2 formula to give us a y 2 value of 2.36 and a 
two-tailed probability of .02. The largest change is in the cell which shows change 
from incorrect to correct, so we can interpret the McNemar's analysis as giving 
us confidence that this procedure of self-guided correction was effective for the 
student. While the example given here is from an individual 5, the data could 
have been drawn from a class of 5s at two points in time. That is, the procedure 
can be applied to an individual case study or to a group. (The data here are ob¬ 
servations rather than 5s.) 

In case you should want to do a McNemar's procedure by hand, it's a very simple 
task. Since the test is for matched pairs of 5s or observations as well as two re- 
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spouses from the same .Vs, let's try it with a new example. Suppose that uu 
wondered whether maintenance of the I I (native language) could be linked to 
being a first child in a bilingual immigrant family. Each firstborn child in the 
study was matched with a non-firstborn child of the same age and same sex. en¬ 
rolled in the same high school. Successful LI maintenance was determined by 
an oral interview conducted by a native speaker of each language. I he data were ■ 
five ratings yielding a possible total of 25 points. A cut-off point of IS was es¬ 
tablished (and justified) for + maintenance. 

The first step is to place the observations into a contingency table, just as we did 
for the Chi-square test: 

Controls 

+ Maint. Maint. Total 

-fMaint. A B A t B 

1st Child 

- Maint. _C_D_ C + D 

Total A+C B + D A' Pairs 

Here arc the fictitious data. To read the table, the A cell shows the number of 
pairs where the first child and the control both successfully maintained the lan¬ 
guage. The B cell shows the number of pairs where the first child was successful 
but the paired control was not. C shows the number of pairs where the first child 
was no: successful but the paired control was. D shows the number of pairs 
where both members of the pair failed to maintain the LI 




Using table 1 in appendix C, we see that the critical value of z for .05 (two-tailed 
hypothesis) is 1.96. (Remember from our earlier discussion of the z score table 
in chapter 8 that 1.96 is the critical value that allows us to reject the null hy- 
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a pothesis at the .05 level of significance.) We cannot reject the null hypothesis 
A because our z value is less than the critical value of 1.96. Our conclusion, if the 
A data were real, must be that LI maintenance for these 5s is not strongly related 
A to being the first child in a bilingual immigrant family. The two groups do not 
| differ significantly. However, since the probability is .089 (z = 1.70, area beyond 
\z - .0446, times 2 tails = .089), we could say there is a. "trend" in this direction 
A in the data which warrants further investigation. We might want to revise the 
A research so that we could see if language maintenance is related to other factors 
j as well. 

There is one important caution to keep in mind when doing the McNcmar's test 
by hand. The number of changes must be > 10. That is, B + C must be 10 or 
more. Most computer programs automatically correct for this and uses a Chi- 
square distribution (with a Yates' correction factor) instead of the z distribution 
to calculate probability. The z distribution is much easier to do by hand. Just 
remember, though, that the number of changes (whether positive or negative) 
must exceed 10 before you can estimate the probability in this way. 

AOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 


Practice 14.6 

► 1. Your school district is considering establishing an immersion program. Be- 
; ginning in kindergarten, one class at one elementary school would be taught 
completely in the "heritage language" of the community (Swedish in this case). 
Elementary school teachers in the district attend a weekend seminar on the issue, 
listening to the arguments of community leaders, immersion experts, parents, and 
administrators, and then discuss the issue themselves in small group workshops. 
Teachers cast a straw vote at the beginning of the seminar and again at the close 
of the seminar. 


Before 


After 


Yes 

No 


Yes 

15 

10 


No 

28 

21 


In the space below, compute the value of z using a McNemar's procedure. 

Jb + c 


Can you reject the H 0 1 What can you conclude? _ 
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2. Seminars often have an immediate effect on opinions but. later, opinions may: 
again change. To improve the study, what suggestions might you make? Would? 
you still be able to use a McNemar's procedure to analyze the data? Why (not)?; 


ooooooooooooooooooooooooooooooooooooo 

To summarize this chapter: 

1. Frequency data on separate Ss or observations = Chi-square 
analysis. 

2. Frequency data on samples from the same or matched 5s = 
McNemar's test (or CATMOD in SAS). 

When computing x 2 hy hand, it is important to remember that two adjustments 
must be made in certain circumstances. First, when the number of degrees of 
freedom equals .1, an adjustment is needed--use the Yates' correction factor. 
Second, :f expected values for cells drop below Five, either coliapsc some of the 
cells (if this makes sense in the study) or use the Fisher's Exact test. In summary: 

1. df > 2, apply Chi-square. 

2. df = 1, apply the Yates' correction factor. 

3. Expected cell frequency > 5, apply Chi-square. 

4. Expected cell frequency < 5, collapse cells or use Fisher's Exact 
test if the df is only 1. 

In interpreting the results, discuss the findings in terms of the cells where the 
observed values differ most from expected cell frcqucncics-thc (O - Ef 4- £ val¬ 
ues. Ryan's procedure may be used to locate more precisely the difference among 
the cells. 

In the following chapter, we will consider tests that allow us to discover relation¬ 
ships between variables when the data are measured using rank-order scales or 
interval scores. 


Activ ilics 

Read each of the following abstracts. Decide whether Chi-scuare might be an 
appropriate statistical test for the data (or some part of the data). Give a ra¬ 
tionale for your decision. Draw the design box(es). Note any special corrections 
that might need to be made if the data were analyzed by hand. 
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1. .1. Holmes (1986. Functions of you know in women's and men's speech. Lan¬ 
guage in Society. 15, I, 1-21.) set out to investigate the assumption that you 
know, often regarded as a lexical hedge, is more characteristic of women's speech 
than men's. A 50,000-vvord spoken corpus, roughly 40% formal speech and 60% 
informal speech, was examined for instances of you know. Half of the corpus was 
men's speech, the other half women's. The analysis indicated that there were r.o 
differences in the number of you knows produced by men and women, but there 
were sex differences in the functions of you know. Women used you know more 
frequently than men as a token of certainty, while men tended to use you know 
as a mark of uncertainty. 

2. K. L. Porreca (1984. Sexism in current ESL textbooks. TESOL Quarterly, 
19. 4, 705-724.) did a content analysis of the 15 most widely used ESL textbooks. 
Results show that men were mentioned twice as often as women in text and pic¬ 
tures; that men were mentioned first in exercises, sentences, etc., three times as 
often as women; and that there were six male workers appearing for every work¬ 
ing female. Masculine generic constructions were used extensively; the adjectives 
used tc describe women focused on emotions, appearance, and marriage while 
renown, education, and intelligence characterized men. 

3. U. Connor (1984. Recall of text: differences between first and second language 
readers. TESOL Quarterly. IS. 2, 239-256.) asked 10 native English speakers 
and 31 ESL learners to read a passage from the Washington Post and then to 
write immediate recalls of its content. LI subjects recalled more propositions 
than ESL students, but there was no significant difference between these groups 
in recall of "high-level" ideas from the text. 

I. S. Blum-Kulka & E. Lcvcnston (1987. Lexical-grammatical pragmatic indica¬ 
tors. Studies in Second Language Acquisition, Special Issue 9. 2. 155-170.) used 
Chi-square tests to compare the request strategies of native speakers and non¬ 
native speakers of Hebrew. Here is one table from the study that compares the 
request as to whether it is hearer oriented (c.g., "Could you lend me your notes"), 
speaker oriented (c.g., "Could I borrow your notes'), a combination of hearer and 
speaket oriented, or impersonal. Each analysis has a 4 X 2 contingency table 
(although the data are presented on one line in the table). The numbers in the 
parentheses arc the raw data for the study. They show the number of Ss whose 
request fell into each category. The percentage figures are given directly above 
these data. (If you add percentages for each NS cell, you will sec that they total 
100%, as do those for NNS.) The x 2 value given refers to the contingency table 
for that particular situation. Select one situation and, using the raw score data, 
recompute the x 2 value for that situation. Interpret the value in terms of the 
larger (O - E? y E. 
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Hearer 

Speaker 

Hear/Spkr 




oriented 

oriented 

oriented 

Impersonal 


N 

NN 

N 

NN 

N 

NN 

N 

NN 


% 

% 

% 

% 

% 

% 

% 

% 

Situation 1 

X 2 = 7.669 

82.7 

70.8 

4.9 

8.2 

4.3 

9.0 

8.0 

12.0 

3 df, p <.05 

(234) 

(165) 

(8) 

(19) 

(7) 

(21) 

(13) 

(28) 

Situation 5 

X 2 = 5.908 

50.6 

42.6 

19.4 

30.0 



30.0 

27.4 

2 df, p <.05 

(86) 

(101) 

(33) 

(71) 



(51) 

(65) 

Situation 7 

X 2 - 14.662 

21.2 

29.2 

23.6 

28.6 



55.2 

42.8 

2 df, p <.03 

(35) 

(62) 

(39) 

(61) 



(91) 

(89) 

Situation 11 

X 2 = n.s. 

81.7 

82.3 

10.4 

5.0 



7.3 

12.7 


(134) 

(149) 

(17) 

(9) 



(12) 

(23) 

Situation 15 

X 2 = 12.535 

71.3 

63.6 

25.3 

22.6 

.7 

.8 

2.0 

12.9 

4 df p <.01 

(107) 

(84) 

(38) 

(30) 

(1) 

(0 

(3) 

(17) 


Situation I =* Student asks roommate to clean up the kitchen, which roommate 
left in a mess. 

Situation 5 A student asks another student to lend lecture notes. 

Situation 7 = A student asks people living on the same street for a ride home. 
Situation II — A policeman asks a driver to move her car. 

Situation 15 A university teacher asks a student to give his lecture a week 
earlier than scheduled. 

5. D. Tannen (1982. Ethnic style in male-female conversation. In J. J. Gumpcrz 
[Ed.]. Language and Social Identity. New York, NY: Cambridge University 
Press, 217-231.) discusses misunderstandings in male-female conversations where 
one party is being more or less "direct" while the other is being rather "indirect." 
The "indirect" party gives out hints which arc missed while acting on hints which 
were never intended; the "direct" party misses hints and is unaware that the 
partner is acting in response to perceived hints. Such misunderstandings are 
common among members of the same culture group, but mix-ups may be even 
more characteristic of cross-cultural communication. To show this, Tannen did 
a mini-experiment where she asked Americans, Greeks, and Greek-Americans to 
interpret transcripts of real conversations such as the following: 

Wife: John's having a party. Wanna go? 

Husband: OK. 

Wife: I'll call and tell him we’re coming. 
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Then Tannen directed the Ss: "Based on this conversation only, put a check next 
to the statement which you think explains what the husband really meant when 
he answered 'OK/" 

1. _My wife wants to go to this party, and since she asked, I'll go to 

make her happy. 

2. _My wife is asking if I want to go to a party. I feel like going, so I'll 

say yes. 

Later, the same couple had this conversation: 

Wife: Are you sure you want to go to the party? 

Husband: OK, let's not go. I'm tired anyway. 

The Ss then put a check next to the statement which explains what the wife really 
meant by "Are you sure you want to go to the party?" 

3. _It sounds like my wife doesn't really want to go, since she's asking 

me about it again. I'll say I'm tired, so wc don't have to go, and she won't 
feel bad about preventing me from going. 

4_Now that I think about it again, I don't really feel like going to a 

party because I'm tired. 

The table for choice 1 shows the following frequencies: 

American Greek Greek-American 

Female Male Female Male Female Male 

5 3 8 5 9 4 

The focus of Tannen's paper is not on this mini-study but rather on the inter¬ 
pretive analysis of conversations. However, for our purposes, let's consider how 
we might expand the mini-study. Unfortunately we do not have the tapes of the 
actual conversations (and suprasegmentals would be very important in this 
study). Nevertheless, reproduce the questionnaire and administer it to four 
Americans (two male and two female) and four Ss from other ethnic groups (two 
male and two female). In your study group, combine the data to form your own 
contingency tables for the responses to items 1-4. If the data warrant it (i.e., your 
expected cell frequencies are large enough), do a Chi-square analysis and inter¬ 
pret your findings. 

6. In chapter 10 (page 306) we asked you to draw a chart showing options for 
statistical procedures. Add the procedures given in this chapter to your chart 
along with those of chapters 11 through 13. 
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Chapter 15 
Correlation 


•Measuring relationships 
•Pearson correlation 

Computing the correlation coefficient 
Computing correlation from raw scores 
Assumptions underlying Pearson correlation 
Interpreting the correlation coefficient 
Correction for attenuation 
Factors affecting correlation 
•Correlations with noncontinuous variables 
Point biserial correlation 
Spearman Rank-order correlation (rho) 
Kendall's Coefficient of concordance 
phi: correlation of two dichotomous variables 


Measuring Relationships 

In previous chapters, we have tested differences in the means of two or more 
groups with t he aim of establishing the effect of the independent variablc(s) on a 
dependent variable. In our research, however, we often hope to establish a re¬ 
lationship between the scaled or scored data of one variable with those of an¬ 
other. For example, if we wonder whether a good short-term memory is related 
to success in language learning, we might want to know the relationship of 5s' 
scores on a short-term memory test and their scores on a language proficiency 
test. Do students who score well on one exam also score well on the other, and 
do students whose scores are low on one also do poorly on the other? We need 
a way to measure this relationship directly (rather than as an exploratory strength 
of association estimate following a r-test or ANOVA procedure). 

In selecting any statistical procedure to measure relationships, we begin by asking 
several basic questions. 

1. How many variables are being compared? What are they? 

2. Is the comparison between groups or repeated-measures of the same or 
matched groups? 

3. How have the data been measured: frequencies, rank-order scales, or interval 
scores? 
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1-or the above example, there are two \ariables being compared: shoit-term 
memory and language proficiency. The comparison involves repeated observa¬ 
tions of the same 5s (repeated-measures), and the data have been measured as 
scores (interval measurement). 

As a second example, imagine that you and a friend team-taught two composition 
classes. At the end of the course you both read all the final exam compositions, 
rating them using a department guidcshcet which gives 25 possible points (five 
5-point scales). You wonder how strong an agreement there is between your 
ratings. That is, when you give high marks to a student, does your team teacher 
also give high marks? When one of you gives a student a low mark, docs the 
other also give a low mark? Again, you want to describe the strength of the re¬ 
lationship between these two sets of ratings. 

Here there are still two vaiiabies: your ratings and those of your partner. For¬ 
tunately, correlations can be carried out where data are from the same Ss on 
different measures or when ratings of performance are obtained from different 
Ss. This time, though, the data are measured using an ordinal 5-point scale (and 
the distribution may or may not be continuous) rather than an interval score. 
Most correlations are for linear (i.e., interval) data, but special correlation for¬ 
mulas are available that allow us to compare rank-ordered and even dichotomous 
(nominal) data. 

You may also sec correlation used when the data are really open-ended frequency 
data. For example, a frequency count of some particular linguistic feature might 
be correlated with the frequency tally of a second feature. The number of 
questions a teacher asks might be correlated with the number of questions stu¬ 
dents produce (again, open-ended frequency tallies). In such correlations, the 
researcher assumes the open-ended counts, since they are incremental in nature, 
are the same as equal-interval scores. Clearly, they are not if the observations 
are open-ended in length. That is, if you count instances of codc-switching of 
students and want to relate that to, say, vocabulary use of students, neither var¬ 
iable may be continuous. There is not an equal opportunity for all 5s to accu¬ 
mulate some optimal number of code-switches. If the vocabulary measure is the 
number of words each 5 produces, again 5s have unequal opportunities to 
produce words. Some researchers solve this problem by converting frequency 
tallies to rates, ratios, or percentage values. For example, each 5 might have a 
X number of switches per 100 words, per turn, or whatever and the vocabulary 
measure might be turned into a type-token ratio. This conversion (as we men¬ 
tioned in chapter 5) must be justified to the satisfaction of the field. Then, the 
researcher should check the data for normality of distribution to be confident that 
they spread evenly throughout the range. (As you can tell, we are less happy with 
converting data from frequency to ordinal to interval measurement than the re¬ 
verse! There are too many studies in our field where frequencies arc converted 
to percentages [e.g., /= 2 = .50], averaged for a group [X percent], and then 
entered into statistical calculations as if they were interval measures.) 

This doesn't mean that we cannot use correlations to discover relations where one 
variable may be a category (i.e., nominal) and the other an interval or ordinal 
scaled variable. There arc special correlation formulas for such data. There is 
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also a special correlation for two category variables. The Pearson correlation, 
however, is restricted to variables where measurement is truly continuous. 


Pearson Correlation 

The Pearson correlation allows us to establish the strength of relationships of 
continuous variables. One of the easiest ways to visualize the relationship be¬ 
tween two variables is to plot the values of one variable against the values of the 
other. Let's use fictitious data on short-term memory and language proficiency 
as an example. Assume that data have been collected from a group of students 
on each of the two variables. If we have only a few students, it will be easy for 
us to plot the data. The plot is called a scatterplot or scattcrgram. 

To show the relationship, the first step is to draw a horizontal line and a vertical 
line at the left of the horizontal line. It is conventional to label the vertical axis 
with the name of the dependent variable and the horizontal axis with the inde¬ 
pendent (though these designations are arbitrary). Short-term memory will be 
the independent variable in this example and proficiency the dependent variable. 
This classification is fairly arbitrary, for we are not looking for the effect of an 
independent variable on a dependent variable (as in f-tests or ANOVA). Rather, 
we are searching for the degree of relationship between the two. 

Here are the scores for the first five 5s. 


5 

STM Score 

Prof. Score 

1 

25 

100 

2 

27 

130 

3 

28 

200 

4 

30 

160 

5 

35 

180 


The first 5's proficiency score is 100 and we located that score on the horizontal 
axis. The student's short-term memory score is 25 and we located that on the 
vertical axis. Imagine a line straight out from each of these points; we placed 
a symbol where the two lines would meet. 



100 120 140 160 180 200 

PROFICIENCY 
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Facli S's place is now marked on the seatterplot. It appears that as the scores 
on one test increased, the scores on the other test, also increased. S5, however, 
looks a bit different from the others. If we look at the first four Ss, we would 
expect that SS would have scored higher on S I M so that the dot would line up 
more closely with those of the other .Ss. 

Imagine that there was a perfect relation between the two variables. Such things 
do not happen in reality (rot even if we measure exactly the same thing twice), 
but let's imagine anyway. Assume that you and your team teacher actually gave 
each student exactly the same rating on their final compositions. The assignment 
of the variables to the x and y axes is really arbitrary. But, again, we want to 
know the strength of agreement between the two sets of values just as wc did in 
the previous example. Here is the seatterplot: 


Tchr 1 


Tchr 2 


If wc draw a line through the symbols, the line is perfectly straight. There is a 
perfect positive relationship between the two sets of scores. As the ratings as¬ 
signed by one teacher rise, the ratings by the other rise--so the relationship is 
positive. 


Tchr 1 


Tchr 2 
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| Now, imagine that you and your team teacher never agree on anything. In fact, 
|the higher you rate a composition, the lower she rates it. (Again, this wouldn't 
|happen in reality—especially since you are using a guidesheet in scoring the com- 
| positions!) The resulting scatterplot would look like this: 


Tchr 1 




\ 




\ 




Tchr 2 

The relationship is now a perfect negative correlation. That is, increase in your 
ratings is perfectly matched by an equivalent decrease in your team teacher's 
ratings. 

Perfect correlations, whether positive or negative, are really impossible (except 
when data are correlated with themselves!). That is, if you try to draw a line 
through all the points (symbols) placed on a scatterplot, the result will not be a 
^ straight line. The data do not usually line up in this way. However, it is possible 
to see the general direction that a straight line would take if the two variables are 
related. 

; Look at the following diagrams. Notice that the line does not touch all the points 
but that it does reflect the general direction of the relationship. If you draw a 
straight line to show the direction of the relationship, you can see whether the 
i correlation is positive or negative. 


It is also possible that no relationship will be apparent from the scatterplot. Im¬ 
agine that you and your team teacher were asked to judge composition of stu¬ 
dents from another school. Here is the scatterplot for your ratings: 
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Tchr 1 


Tchr 2 

li is impossible to draw a straight line in this scatterplot to show the direction of 
the relationship because there is no pattern to go by—that is, there is no real re¬ 
lationship between the two ratings. You must have very different standards of 
excellence for judging compositions. This shouldn't happen if raters arc trained 
beforehand. 

Before we look at the statistical procedure that pinpoints the strength of the re¬ 
lationship in correlation, look at the three scattcrplots below. In the first exam-1 
pie, the points most tightly clustered around the straight line. In the second, they 
are moderately scattered out from the straight line, and in the third figure, the 
scatter is greatest and no line is possible. 



The actual strength of the correlation is reflected in these scattcrplots. The 
tighter the points cluster around the straight line, the stronger the relationship 
between the two variables. (Remember that in a perfect correlation, the points' 
were right on the line.). 

Of course, when you draw a scatterplot (or ask the computer to draw the 
scatterplot for your data), the straight line doesn't appear. You have to imagine 
it. However, there is a technical term for the imaginary straight line around 
which all the points cluster. It is called the regression line, and the angle of the 


430 The Research Manual 





Regression line is called the slope. The amount of variation of the points from the 
■Regression line and the tightness of the clustering of the points to the line deter¬ 
mine the magnitude , the strength of the correlation coefficient. 


ooooooooooooooooooooooooooooooooooooo 

(practice 15.1 

|f. As a review, what docs each point on the scatterplot represent?_ 


If the straight line rises, the correlation is said to be_. If it falls, the 

correlation is_. If it is impossible to visualize a straight line in the 

| data display, the relation is so weak that no correlation can be seen. 

2. Review your own research questions. For which might you want to establish 
a relationship via correlation? 


Draw a sample scatterplot for the correlation that you hope to obtain. If you 
e.spevt the correlation to be strong, how will the points be placed relative to a 
straight line? 


ill' you expect a correlation to exist but expect that it will not be extremely strong, 
luiw wil. the points appear relative to a straight line? _ 


ooooooooooooooooooooooooooooooooooooo 

Computing the Correlation Coefficient 

i Now that the concept of correlation has been presented, we will turn to the ways 
of measuring the strength of the correlation in more precise terms. If you re¬ 
member the discussion of z scores in chapter 7, you may already have an inkling 
; as to how we might compare scores when they are from different data sources. 

Suppose that we wanted to discover the relationship between scores on two tests. 

| Let's take the example where the university wonders whether they can use infor¬ 
mation obtained from SAT verbal scores (information available with a student's 
I application for admission) instead of an ESL language test to determine whether 

j or not students should be required to take ESL classes. If the scores on the verbal 

portion of the SAT are very strongly related to those on the ESL language test, 

I then perhaps this would be possible. The first problem, of course, is that a score 

[ of 80 on one test does not equal a score of 80 on the other. The means for the 

| two tests are not the same, nor is the standard deviation the same since the tests 
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have different properties. However, we can equate the values of the two variables 
(verbal SAT and ESL test) by converting them to z scores. (You will remember 
that the z_score is a way to standardize scores across tests and is computed as 


Suppose that you asked the computer to convert each student's score on the 
verbal SAT and on the ESL test to a z score. The scores are now equivalent. 
(That is, a z score of 1.1 on one variable will equal a score of 1.1 on the other.) 
If a student has a positive z score on one variable, we expect the same student 
will have a positive z score on the second variable. If this is the case for all stu¬ 
dents, the correlation will be positive. If students who have positive z scores on 
one variable have negative z scores on the second variable, the correlation will 
be negative. The closer each student's z score is on the two ratings (whether 
positive or negative), the greater the strength of the relationship between the two 
tests. 

Here are the z scores of nine students on the two measures: 


5 

SAT Verbal 

ESL Test 

Product 

1 

+ 1.5 

+ 1.5 - 

2.25 

2 

+ 1.2 

+ 1.2 

1.44 

3 

+ .8 

+ .8 

.64 

4 

+ .4 

+ .4 

.16 

5 

.0 

.0 

.00 

6 

-.4 

-.4 

.16 

7 

-.8 

-.8 

.64 

8 

-1.2 

-1.2 

1.44 

9 

-1.5 

-1.5 

2.25 


In the above table, 55 scored at the mean for each test since her z score is 0 on 
each. The 5s with positive z scores on one measure also have positive z scores 
on the second. Those with negative z scores on the first measure also have neg¬ 
ative z scores on the second. If you look at the z scores, you can see that each 
student's score on one test exactly matches that on the other. This is another one 
of those impossible perfect correlations. But how, other than comparing the z 
scores, do we know this is a perfect correlation? 

You will notice that there is a column labeled "product" in the above table. This 
was obtained by multiplying each 5's z scores. If we add the cross products in 
this column, the answer is 8.98. Actually, the answer would be 9.0 if we had not 
rounded off the number. If we take the average (divide the total by 9, since there 
are 9 5s in this table), the answer is 1. This is the value of the correlation coef¬ 
ficient for these data, the value of an absolutely perfect correlation. 

Here is the z score formula for the Pearson correlation. (Notice that this formula 
uses N — 1 in the denominator while the previous example used N. N — 1 is an 
adjustment for sample data.) 
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It is possible that you might someday have z score data and need to use this ver¬ 
sion of the formula. However, it is more likely that you will start with raw scores. 

The correlation coefficient, symbolized by the letter r, can be calculated in several 
ways. Whatever method is chosen, the value r will always be somewhere between 
-1 and 0 or 0 and +1. The closer the r is to ± I, the stronger the relationship 
between the variables. 


Computing Correlation from Raw Scores 

When wc assign a value (a rating or a score) to a variable, we usually enter this 
value as raw data on a scoring sheet (or directly into the computer). Wc don't 
want to go to the trouble of converting each of these values into z scores. The 
conventional formula for the correlation coefficient takes care of this conversion 
for us. Let's work through an example using raw score data. 

Imagine that you have undertaken a study that requires you to have valid pto- 
nunciation ratings for bilingual children in their first language. To do this, your 
study group advisors suggest that you ask two native speakers to listen to the 
taped data and judge each child on a 15-point scale (from three 5-point scales fur 
segmental phonology, general prosodic features, and absence of L2 features- 
na liven ess). 

Here are the fictitious data: 


S 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 


Judge / 
12 
10 
11 
9 
8 
7 
7 
5 
4 
3 


Judge 2 
8 

12 

5 

8 

4 

13 

7 
3 

8 

5 


Because the formula looks complicated, let's go through the computations in five 
separate steps. 

1. List the scores for each S in parallel columns on a data sheet. (Arbitrarily 
designate one score as X and one as Y as we did for the scatterplots.) 

2. Square each score and enter these values in the columns labeled X 1 2 and Y 2 . 
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3. Multiply the scores (Xx Y) and enter this value in the XY column. 

4. Add the values in each column. 

5. Insert the values in the formula. 


The information from steps 1 through 4 arc entered in the following chart. 


5 

X 

Y 

X 2 

Y 2 

XY 

1 

12 

8 

144 

64 

96 

2 

10 

12 

100 

144 

120 

3 

11 

5 

121 

25 

55 

4 

9 

8 

81 

64 

72 

5 

8 

4 

64 

16 

32 

6 

7 

13 

49 

169 

91 

7 

7 

7 

49 

49 

49 

8 

5 

3 

25 

9 

15 

9 

4 

8 

16 

64 

32 

10 

3 

5 

9 

25 

15 

Totals 

76 

73 

658 

629 

577 


The formula for the correlation coefficient, sometimes called the Pearson 
Product-moment correlation , follows, r is the symbol for the Pearson correlation 
coefficient. The subscripts x and y stand for the two variables being compared 
(the somewhat arbitrarily specified dependent and independent variables). 

A ( y.vr,-(V,v X y]n 

r XV — / . 

J [A'T-V 2 - <X/) 2 ][a£ Y 2 - (Y K) : ] 

Since all the symbols may be a bit overwhelming, we have filled in the values 
from the data table. Check to make sure that you can match the symbols and the 
values. Then use your calculator to check the computations shown below. 

r xy ~ ,——-—-- 

V'LA'X* 2 <X' v,2ir/v /t/ 2 ( X r)2] 


(10X577)-(76X73) 

7[(10X658) - (76) 2 ][(10X629) - (73) 2 ] 


r 222 
Xy V804 x 961 
r xy = .25 
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The final value of r shows a positive correlation. You remember that the value 
of r will always be between 0 and I. If it isn't, you will need to rccheck your 

calculations. 

We also said that the closer the r value is to 1, the stronger the relationship be¬ 
tween the variables. This docs not look like a very strong relationship. We ll 
discuss just how you can interpret the strength of the correlation in a moment. 

ooooooooooooooooooooooooooooooooooooo 

Practice 15.2 

1. What suggestions could you give the raters that might might help them reach 
a closer agreement on the ratings? 


► 2. Calculate the correlation between these z scores from two tests, one a writing 
error detection test (where 5s locate grammar errors in paragraphs) and the other 
a multiple-choice grammar test. 


5 

WED 

Gram 

1 

-.40 

0.00 

2 

0.00 

-.36 

3 

-.80 

-.73 

4 

.80 

0.00 

5 

1.20 

1.09 

6 

-1.60 

-1.45 

7 

.80 

1.45 


► 3. The data in the following table are fictitious but are given to show ratings 
of 10 students over five conversations on factors related to Grice's (1975) maxims 
of cooperative conversation. According to Grice, a cooperative conversationalist 
does not violate constraints on quality and quantity in his or her contributions to 
conversational exchanges. Responses must be topic-relevant, truthful, of appro¬ 
priate quantity and appropriate quality. In the following table are scores for 10 
students for five factors that reflect these constraints. For each conversation, the 
5 was rated on a 6-point scale for topic maintenance (Grice's maxim of "Be rele¬ 
vant"), for appropriate level of informativeness of content (Grice's maxim of "Be 
truthful"), for clarity (Grice's maxim to "Be clear"), and two scales for appropri¬ 
ate quantity-appropriate turn-taking and appropriate turn length (Grice's 
maxim of quantity: "Say enough but not too much"). A perfect score for each 
factor, then, would be 30 points (6-point scale and 5 conversations) and a perfect 
total test score would be 150. Do a correlation of any two variables. Use either 
the z score or the raw score method, whichever you like best. (If you want to 
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durck t»> make sure you really do get the same value for r with each formula, you 
can do it both ways!) 


5 

Relev. 

Inform. 

Clarity 

T-take 

T-length 

1 

14 

23 

19 

15 

10 

2 

12 

24 

20 

18 

18 

3 

13 

15 

18 

22 

21 

4 

12 

18 

20 

21 

25 

5 

24 

14 

16 

20 

25 

6 

25 

24 

20 

23 

25 

7 

18 

8 

10 

15 

20 

8 

28 

2 

2 

19 

26 

9 

14 

17 

15 

22 

19 

0 

16 

12 

7 

14 

22 


Variables correlated:_ 

r 


4. Which formula do you think helps you the most to understand exactly what 
r represents? Why? ______ 


ooooooooooooooooooooooooooooooooooooo 

Assumptions Underlying Pearson Correlation 

There are four basic assumptions that must be met before applying the Pearsor. 
correlation as a measure of how well any two variables "go together." 

/. The data are measured as scores or ordinal scales that are truly continuous. 

Intelval and ordinal data are appropriate data for correlation. (Remeinbei that, 
statisticians call these measures continuous ?) However, it is important that the 
ordinal measurement be truly continuous (i.e., approach equal-interval mea¬ 
surement and a normal distribution). 

2. The scores on the two variables, X and Y, are independent 

We know that the rating on one variable should not influence the rating on the 
second. One very obvious violation of this assumption may pass unnoticed when 
the Pearson correlation it used to test the relationship that exists among many 
pairs of variables in the same study. For example, you might want to know how 
well scores students obtain on subtests of a test battery "go together." I or in¬ 
stance. is the relation between vocabulary and reading strong-does a high, posi¬ 
tive correlation exist between these two subtests? What about the correlation 
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[between grammar and reading? Between a section which tests cultural appro¬ 
priateness of speech acts such as warnings, compliments, threats, or condolences 
[and a grammar subtest? Or cultural appropriateness and reading scores? Stu¬ 
dents' subscores on all these tests can legitimately be correlated with each other. 
[It is also possible to correlate subscores with the total score for the battery. There 
[is, however, one very important thing to remember. The total test score contains 
[all the values of the subtest scores. The test scores on X and Y are not inde¬ 
pendent. To solve this problem, subtract the subtest score from the total test score 
[before running the correlation or use other corrections available in computer 
•packages. When reading journal articles or reports where part-to-wholc corre¬ 
lations such as these arc given, it is difficult—if not impossiblc--to be certain that 
[the assumption of independence of the variables has been met. Each researcher 
linust remember to meet the assumption, and it would be helpful if such compar¬ 
isons were reported as corrected part-to-wholc correlations. 

:£?. The data should be normally distributed through their range. 

[The range of possible scores in each variable should not be restricted. For ex¬ 
ample, with a 5-point scale, the scores should not all fall between 3 and 4 or the 
[correlation will be spurious. It is also important that extreme scores at the far 
[edges of the range for each variable not distort the findings. Check for outliers 
[and consider the influence of these scores in interpreting the strength of corre¬ 
lation. 

4. The relationship between X and Y must be linear. 

[ B Y linearity we mean that we expect that it is possible to draw an imaginary 
[straight line through the points on the scatterplot (and measure how tightly the 
[points cluster around that straight line). 

Sometimes the line that would connect the points is not straight, not linear but 
: curvilinear. The classic example of this is the relationship between anxiety and 
test scores. We all need a little anxiety to sharpen our responses. As anxiety in¬ 
creases our performance increases and so test scores should rise. However, a lot 
of anxiety can have a very debilitating effect on our performance. As anxiety 
increases at the upper ends of the anxiety scale, scores on tests should fall. 

Here is the scatterplot for this relationship: 
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The correlation starts out strongly positive and ends up strongly negative, and the 
actual r value would, of course, be distorted. 

It's also possible to have a curvilinear relationship in the other direction. For 
example, in teaching an introductory course in linguistics, the notion of compar¬ 
ing sound inventories and phonological rules across languages is extremely diffi¬ 
cult for many students. In the early stages, the more students work on the 
homework assignments, the more confused they become. The relation between 
time devoted to study and success on phonology quizzes may be negatively cor¬ 
related. At some point, these same students magically "get' the concept, and 
from that point on, the time devoted to study is positively correlated to success' 
on quiz scores. 

The Pearson correlation is not appropriate when the relationship is curvilinear. 
I‘he problem is that the researcher may not always be aware that a relationship 
might turn out to be curvilinear. l‘hc best way to avoid this possible problem is 
to ask the computer to produce a scatterplot for each correlation. If the 
scatterplot shows a curvilinear relationship, either make an appointment with a 
statistical consultant or read Kirk (19X2) for appropriate statistical procedures for 
measuring the strength of curvilinear relationships or transforming the data. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 15.3 

1. Look back at the example in practice 14.2. Do you believe that five 6-point 
scales for a possible total of 30 points for each variable really meets the require¬ 
ments of the Pearson correlation? Why (not)? 


2. Look at the following examples and determine whether a Pearson correlation 
would be an appropriate measure for determining the strength of the relation of 
the variables. 
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Example A: You wonder whether first language is related to the accuracy of 
grammatical judgments ESL students make on such sentences as She asked me 
to write the report, She made me to write the report. She advised me to be careful, 
She recommended me to be careful. 

{Variable X: _ . 

Variable Y: _ _ 

Rationale for (not) using Pearson correlation: __; 


Example B : You believe that the amount of time students spend outside class 
{using a language is as important as time devoted to language instruction in class. 
You ask every student to keep a diary of the amount of interaction time spent 
{outside of class where the new language is used. You want to know how this re¬ 
flates to the scores they receive on your final examination. 

{Variable X: _ 

•Variable Y: _ 

/Rationale for (not) using the Pearson correlation: __ 


3. If you have data already available on your computer, you might want to run 
a correlation of a subtest with a total test score first without heeding the warning 
about part-to-whole correlations. Then rerun the correlation, subtracting the 
’values for the subtest from the total test score. How much difference does this 
make in the correlation?)_ 


4. Draw the curvilinear relationship between the variables in the introductory 
linguistics course example (page 438) below: 


5. A teacher asked Ss to rate their course for "course satisfaction" using a 5-point 
scale (5 showing high satisfaction). She wondered whether course satisfaction 
and grades were related. She converted the grades to 1 = F, 2 = D, 3 = C, 4 
= B, 5 = A and then ran a correlation for the two variables. Which assumptions 
{: seem to be met and which do not? Would you advise using a Pearson correlation 
| for the analysis? Why (not)? 
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ooooooooooooooooooooooooooooooooooooo 


Interpreting the Correlation Coefficient 

Once you have obtained the r value for the correlation, you still must interpret 
what it means, and caution is warranted. Remember that the magnitude of r 
indicates how well two sets of scores go together. The closer the value is to 1, the 
stronger the relationship between the two variables. 

Researchers use different cutoff points in deciding when a relationship is high 
enough or strong enougli to support their hypotheses. The most sensible way of 
interpreting a correlation coefficient is to convert it into overlap between the two 
measures. To compute the overlap, square the value of r. This allows us to see 
how much of the variance (the variability of scores around the mean) in one 
measure can be accounted for by the other. To the degree that the two measures 
correlate, they share variance. The following figures represent this overlap. 

If there is no correlation between the two measures, the r is 0 and the overlap of 
variance between the two measures is also 0. 


Var X Var Y 

r = 0, r 2 = 0 


If the correlation of two measures is .60, the variance overlap between the two 
measures is .6 2 or .36. 
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r = .60, r 2 = .36 


If the correlation is .85, the shared variance is r 2 or .72. 



r =• .85, r 2 = .72 


The overlap tells us that the two measures are providing similar information. 
Or, the magnitude of r 2 indicates the amount of variance in X which is accounted 
for by Y or vice versa. If there were a perfect correlation between two variables, 
the overlap would be complete—the boxes would match perfectly. To the extent 
that the correlation deviates from the perfect overlap of 1, we lose space shared 
by the two measures. The strength of the relationship becomes weaker. 

If we computed a correlation between a test of grammar and a test of general 
language proficiency, and the correlation turned out to be .71, we could say that 
the two tests overlap to the extent of r 2 (or .504). This is a fairly strong corre¬ 
lation; the overlap is 50.4%. However, if you hoped to show that the two tests 
measured basically the same thing, the correlation isn't very strong. You would 
want an r in the high .80s or .90s. You would want a very high overlap of the 
two measures. 

If you wanted to have confidence in the ratings that two raters assign to students' 
oral reports, you would also expect that the r would be in the high .80s or .90s. 
That is, you want the ratings to overlap as much as possible. 
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On the other hand, if you ran a correlation between a test of cultural appropri¬ 
ateness of various speech act expressions and general language proficiency, the >• 
would probably be lower than .80 because language proficiency might have se¬ 
veral components, only one of which is "cultural appropriateness." Since you 
would not expect that the two tests measure the same thing, a correlation of .30 
to .50 (an overlap of 9% to 25%) might be expected, and therefore be an ac¬ 
ceptable correlation. 

A correlation in the .30s or lower may appear weak, but in educational research 
such a correlation might be very important. For example, if you could show a 
correlation between early success in learning to read and later success in second 
language acquisition, this might be important if you wanted to locate prospective 
students for a FLES (foreign language in elementary school) program. While no 
causal claim could be made, in the absence of any information on other variables 
more strongly related to FI.F.S achievement, you would do well to select early 
readers for the program. 

Beginning researchers often make the mistake of asking the computer to judge the 
"significance" of a Pearson correlation. Remember that judging "significance" in 
this ease only determines whether you can reject the null hypothesis of no re¬ 
lationship between the variables (i.c., whether you can reject the no overlap' 
figure shown on page 440). A fairly weak correlation will allow you to do this. 
1 he critical values needed to reject the null hypothesis of no correlation are given 
in table X, appendix C. Imagine that you have data from your class of 37 .Ss on 
the TOEFL and your final exam score. Let s see how strong a correlation vou 
would need in order to reject the null hypothesis. 

Tur ning to table 8. you will see that the degrees of freedom for Pearson corre¬ 
lation is the total /V 2. If there are 37 students in your class, the <// is 35. In 
the row for 35 df\ you will find the critical value (a ~ .05) is .3240. If the corre¬ 
lation is equal or higher than .3246, you can reject the null hypothesis of no re¬ 
lationship. 

However, this procedure won't tell you how to judge the importance of the dis¬ 
covered relationship. Look back at the scatterplots at the beginning of this 
chapter. The figure for teacher 1 and teacher 2 on page 430 shows little relation 
between the ratings. Yet, the H 0 could be rejected if the correlation were statis¬ 
tically significant at the .05 level: that is, there is some relationship-not much, 
but some. By looking at this scattcrplot, you should be able to see that a "sig¬ 
nificant correlation" doesn't mean very much. The important thing to do is look 
at the strength of the relationship. To do that, it is best to talk about the re¬ 
lationship in terms of the overlap between the two measures (r 2 ). The overlap 
should be high (with an r in the .80 to 1 range) if the two measures are claiming 
to test the same thing. The correlation can be much lower (with an r in the .35 
to .80s range) and still be very important. 

To make this superclear, let's use a silly example. Assume you wondered whether 
people who are good at languages also make lots of money. You get the scores 
of ail the people who took the ETS tests of French, German, Russian, and 
Spanish at the last administration, look up their phone numbers, and call them 
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all up m ask how much money they earn. Anti they told you! You have 1.0(0 
5s in this study. You run the correlation and obtain an r of • .063. Heart 
pounding and fingers atrcmble, you consult the distribution table for the Pearson 
correlation. Hooray, it's significant at the .05 level! You can reject the H 0 and 
say that there is a relationship. 

However, when you present these results at the next meeting of the MLA Asso¬ 
ciation, some smart heckler asks you how you could be so stupid. Of course 
there's a relationship with that many 5s, yet the strength of association is .063 2 
or .00397. Only 4/1000 of the variability in income earned is accounted for by 
foreign language proficiency. You sink to the floor and hope the rug will cover 
you. Checking the probability of Pearson r tells you if there is a nonchance re¬ 
lationship, but it is the strength of relationship that is the important part. 

Pearson correlation is often used to correlate several pairs of variables at once. 
For example, having given a proficiency test, you might want to know whether 
students who do well on one section of the test do well on another section of the 
test. If there are several sections, the correlations can all be done at once and the 
results displayed in a table. It doesn't make sense to use a table if only two var¬ 
iables are correlated. 

Correlation tables, or matrices, usually look like triangles. That's because the 
computer prints out the correlations among all the variables so each correlation 
is primed twice. To make the table easier to read, only half of it is reproduced. 

l or example, here are the Pearson correlation figures for a sitting of UCLA's 
I S 1. placement exam. The figures show the correlations among the various sub¬ 
texts (dictation, cloze, listening comprehension, grammar, and reading). 



DICT 

CLOZE 

LISTEN 

GRAM 

READ 

Did 

1.000 

.919 

.874 

.925 

.838 

CLOZE 


1.000 

.826 

.883 

.880 

LISTEN 



1.000 

.854 

.774 

GRAM 




1.000 

.796 

READ 





1.000 


The 1.000 figures on the diagonal are, of course, the correlation of one variable 
with itself, so the 1.000 below the column marked D1CT is of dictation with dic¬ 
tation. The next column, marked CLOZE, shows first the correlation of cloze 
with dictation (.919) and then cloze with cloze (1.000). The next variable, LIS¬ 
TEN, shows the correlation of listening comprehension with dictation (.874), then 
with cloze (.826), and then with itself (1.000). Then grammar is correlated with 
dictation (.925), with cloze (.883), with listening (.854), and itself. Finally, read¬ 
ing is correlated with dictation, eloze, listening, grammar, and itself. Now, if we 
tried to fill in the other half of the triangle, we would repeal all this information. 
For example, in the first column, we would add the correlation for dictation ar.d 
doze. That value would be .919 and is already available as the first figure in the 
cloze column. Then we would add the correlation for dictation and listening and 
that is already available as the first value under the LISTEN column. If you gel 
a computer printout that gives all these values, use the row of 1.000 values (the 
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correlation of a variable with itself) as the place at which to divide the table to 
avoid redundant information. Please remember that the correlations are based 
on the data. Unless the observations were randomly selected and the study meets 
all threats to internal and external validity, you cannot generalize. The corre^- 
lation is for these data, not some others. The correlation is also influenced by the 
quality of the test as applied to these Ss--that is, test reliability. We assume that 
the subtests are all equally reliable tests. If they are not, interpretation problems 


Correction for Attenuation 

When correlating a group of variables, we assume that the method of measuring 
each of the variables is reliable. But, if each measurement is not of comparable 
reliability, the correlation itself may be distorted. For example, if we ran corre¬ 
lations in a "fishing expedition" where we wanted to see the relation of many 
different variables to a test of language proficiency, we might expect that all the 
data entered would not be equally reliable. It is possible that the proficiency test 
is highly reliable, the test of short-term memory not so reliable, the test of anxiety 
could have medium reliability, and several of our "home-grown" tests of listening 
comprehension, speech act recognition, and so forth might also vary in terms of 
reliability. With so much difference in reliability of measurement, the correlations 
obtained between the proficiency test and each of these other measures cannot 
be compared. 

We have not yet discussed ways of determining measurement reliability (see 
chapter 18) but, if w r e know the reliability of our tests, we can easily hold reli¬ 
ability constant and correct the correlation. 

The formula for this correction is: 



In this formula, r CA is the corrected correlation. r xy is the value of the uncorrected 
correlation. r [lx is the reliability coefficient for the first variable and r t[y is the re¬ 
liability of the second. 

Let's imagine that we ran a correlation between our home-grown test of speech 
act appropriacy, the short-term memory test, and the proficiency examination. 
The uncorrected correlation values were: speech act and proficiency = .65, 
short-term memory and proficiency = .60. However, when we look at the reli¬ 
ability of the speech act test we find that it is .60 while the reliability of the 
short-term memory test is .90. The reliability of the proficiency test is .95. Let's 
see how this affects the correlation. 
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.65 

7.95 x .60 

r CA = - 86 



Without the correction for attenuation, the two measures (speech act appropriacy 
and shorr-term memory) appear to relate about equally to language proficiency. 
Once the correction has been done, this picture changes in such a way that the 
relation between speech acts and proficiency is actually much stronger (for these 
fictitious data) than the relation between short-term memory and proficiency. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

P tact ice 15.4 

► 1. At the end of the semester, ESL university students in a composition class 
were given three tests: a composition test, a multiple-choice grammar test, and 
a written error detection test (where the 5 must spot errors in short written pas¬ 
sages). The correlation of composition and multiple-choice is .70; that of com¬ 
position and written error detection is .75. The reliability of these three tests is 
composition test .80, written error detection .90, and multiple-choice grammar 
test .95. The question is which of the two tests (the discrete point grammar test 
or the more global error detection test) relates better to the composition test. 
Correct the correlations (between each of these tests with the composition test) for 
attenuation. What can you conclude? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

A final word of warning regarding the interpretation of correlation. Correlations 
measure overlap, how well two variables "go together." This is not the same thing 
as effect or causation. Correlations are never to be interpreted in the light of one 
variable affecting a second variable or causing it to change. That is, in the ex¬ 
ample that tries to establish a relationship between hours spent using the lan¬ 
guage outside class and success on a language achievement test, there can be no 
claim that this extra outside class practice causes an increase in test scores. 
(ANOVA and the /-test procedure investigate effect. If you use those procedures 
for descriptive purposes and have met the threats to internal validity, you can 
talk about the effect of the independent on the dependent variable for the data. 
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And it is possible to generalize if you have met all the threats to external validity! 
as well. Remember that we use eta 2 anti omega 2 to determine strength of re¬ 
lationship in the sample and thus decide how much importance to attribute to the 
effect.) 

Nevertheless, it isn't uncommon to find causal claims being made in the litera¬ 
ture. We are all familiar with junk mail circulars where language schools promise ; 
to have us "speaking like a diplomat" in a new language within weeks of enrolling 
in the course. If they are smart, such programs test our speaking ability prior to 
any exposure to the language. Depending on how the school legally defines 
"speaking like a diplomat, they can easily show a positive correlation between 
time spent in the program and gains toward speaking like a diplomat. Readers 
of such ads interpret the relationship as causal. Obviously, that's a mistake. 


Factors Affecting Correlation 

Let's review the factors that can influence the value of r. First, if you have a re¬ 
stricted range of scores on either of the variables, this will reduce the value of r. 
For example, if you wanted to correlate age with success on an exam, and the age 
range of your .Vs was from IS to 20. you have a very restricted range for one of 
the variables. If a full range of scores is used, the correlation coefficient will he 
more "intcrpretablc." 

A second factor that might influence the value of r is the existence of scores which 
do not "belong"—that is, extreme outliers" in the data. If you can justify remov¬ 
ing an outlier and doing a case study on that particular .V, then it is best to do so 
because that one extreme case can change the value of r. Again, you can't just 
throw out data when they do not fit. You must explain why certain responses 
are exceptional in order to justify their removal. 

A third factor which can influence the value of r is the presence of extremely high 
and extremely low scores on a variable with little in the middle. That is, if the 
data are not normally distributed throughout the range, this will throw the cor¬ 
relation coefficient off. The data need to be normally distributed in the sample. 
You might, however, reconstitute the data as nominal (high vs. low) and do a 
different type of correlation (point biscrial or phi). 

A fourth factor that can affect the value of the correlation is the reliability of the: 
data. Statistical tests assume that data is reliable. In the case of correlation, it 
is important that the reliability of measurement actually be checked. If the 
measurement of each variable is not of comparable reliability, the correlation 
must be corrected using the correction for attenuation formula. 

After all these warnings, the most important thing is to use common sense. It is 
possible to have a very high correlation coefficient-high but meaningless-or a 
very low correlation that still can be important. For example, you might correlate 
the scores 5s receive on their driver's license exam with their scores on a Spanish 
achievement test. It's possible you might obtain a fairly high correlation, but 
obviously it wouldn't mean much. Or you might get a low correlation between, 
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say, musical ability and scores on an achievement test in Chinese. However, with 
other components of language aptitude, this might be an important and useful 
correlation that can enter into predicting success in the acquisition of Chinese 
(particularly if you want to look later at the acquisition of tone). 

Next to using common sense in interpreting correlations, the most important 
thing tc remember is not to make causal claims. A correlation docs not show that 
success on one variable causes success on the related variable. Instead, it shows 
us how strongly related the two variables arc. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 15.5 

1 . Interpret the meaning of the correlations presented for the ESL placemen 
exam (page 443) table. 


2. Imagine you wished to replicate Guiora ct al.'s (I ‘>72) study of the relationship 
between relaxation and pronunciation in a second language. The notion is that 
the more relaxed and less inhibited the 5 becomes, the better the pronunciation. 
To induce relaxation in the 5s, you gave them measured amounts of alcohol, 
tape-recording their pronunciation alter each additional drink. Why can’t yoj 
expect to measure the relationship of relaxation and pronunciation accuracy using 
a Pearson correlation?_ 


3. Imagine that you developed a set of special materials on appropriate com¬ 
plaint procedures. This is one of the objectives of a course in communication 
skills. At the end of the course, 5s are given a course final which tests all the 
communication skills taught in the course. As further validation of your materi¬ 
als, you ran a correlation between 5s' scores on the items for complaints and the 
total final exam score. You present the correlation to your supervisor, who says 
"This correlation can't be interpreted-it violates one of the basic assumptions 
required for Pearson correlation!" Gee whiz-what's wrong with it?_ 


4. After all the data from your ESL placement exam is in, you run a scries of 
Pearson correlations. Surprisingly, yon find that the Pearson correlation between 
the total test score and LI is significant at the .001 level. Duhhh. What went 
wrong here?_ 


ooooooooooooooooooooooooooooooooooooo 
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Correlations with Noncontinuous Variables 


While the Pearson correlation is the most common correlation in applied linguis¬ 
tics research, there arc times when the Pearson formula cannot be used to mea¬ 
sure the strength of the relationship between two variables. 

When the assumptions which underlie the Pearson Product-moment correlation: 
are violated, relationships can sometimes still be measured using other formulas. 
We will present three of them here—point biscrial correlation (which is used when 
one variable has continuous measurement and the other is a true dichotomous 
and nominal variable), Spearman's correlation (which is used when the data of 
two variables have the continuity of the strength of ordered ranks), and Kendall's 
Coefficient of concordance (a procedure which searches for relationships among 
three or more noncontinuous variables). 


Point Biserial Correlation 

As you have noted, the Pearson correlation is used when the data for each vari¬ 
able are continuous. Point biserial correlation can be computed when one vari¬ 
able is a dichotomous nominal variable atu: the other is interval. Point biseria. 
correlations are frequently used in test analysis. For example, answers to a single 
test item are either right or wrong. 1 his is i. dichotomy where I might ~ wrong 
and 2 might = right. Thus, it is possible to run a correlation of the performance 
tm a single lest item with the total test score (minus that one item, of course). If 
there are several subtests within the test for. say, reading, grammar, vocabulary, 
listening comprehension, and so forth, it would be possible to correlate the single 
test item with each subtest score. The single test item should correlate most with 
the subtest in which it appears. In addition, those .S's who do well on the subtest 
should pass the item and those who fail the item should have lower subtest scores. 
If not, then it is probably a poor item to use to test that particular skill. That is, 
the point biserial correlation will tell you how well single items are related to, or 
"fit" with, other items which purport to test the same thing. 

The formula for point biserial correlation is 



In the r represents correlation and the subscript pbi is for "point biserial." The: 
subscript letters i and k in the remainder of the formula are the two parts of the 
dichotomy being compared (e.g., yes. no, pass, fail, and so forth). The X p is the 
mean total test score for Ss who answered the item correctly. X q is the mean total 
test score for 5s who answered the item incorrectly. s x is the standard deviation 
of all the scores on the test, p is the proportion of 5s who got the item right, and 
(f is the proportion of 5s who got it wrong. 
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If you gave the language lest to 200 students, the point biserial correlation be- 
| tween this test item and the total test would be .33 if: 

1. The mean for Ss answering the item right was 72.3. 

2. The mean for Ss who answered the item wrong was 62.5. 

3 . The proportion of Ss answering the item right (p ) was .42. 

4. The proportion of Ss answering it wrong (< 7 ) was .58. 

5. The standard deviation ( s ) was 14.4. 

Let's put the values in the formula, and see how this works: 

— 

r pbi =——vw 


'pbi “ 


14.4 

>/=- 33 


: Test developers typically look for items which are negatively correlated with the 
: subtest they should represent. These are "bad" items and should be removed if 
; pos-uble. Correlations in the .20 to .40 range arc typical of "good" test items be¬ 
cause they show lest items where 5s with high test scores pass the item and those 
with low scores fail the item. 

One very interesting example is that presented by Hudson and Lynch (1084) 
where point biserial correlations are given for test items with subtest scores. I wo 
different sets of correlations arc given. The first is for 5s who have had instruc- 
: tion on the objectives tested by the items and the second is a comparable class just 
; prior to instruction. Here are the correlations for five listening test items. 

Listening Item Point Biserial Correlations 
Item + Instruct. Instruct. 


1 

.23 

.29 

2 

.14 

.15 

3 

.06 

.31 

4 

-.60 

.19 

5 

.09 

.02 


We expect 5s who have been instructed to show a small range in scores and lower 
point biserial correlations. The point biseiial correlation values will be higher for 
: the uninstructed group because 5s in the uninstructed group have greater vari¬ 
ance. The reason this example is especially interesting here is that the correlations 
help to show which items most discriminate between high and low scorers on the 
! test. For example, some of the items show a high correlation for uninstructcd (for 
: example, item 3 is .3l--thc item is good for this group, they don't all already know 
it) and a lower correlation for the instructed group (.06). The item is weaker for 
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this group; there is little variation m their responses, they "know" it (which can 
be taken to reflect successful instruction since there was ample variation for the 
uninstructed group). Thus, point biserial correlation can be useful in examining 
criterion-referenced tests and criterion-based instruction. 


Point biscrial correlations are occasionally used for purposes other than test 
evaluation in applied linguistics. We've already noted that a Pearson correlation 
is spurious when the data are not well distributed in the range. In cases where 
one variable is continuous but the second consists of scores at the top and bottom 
of the distribution with few scores in the center, the second variable could be 
changed to a "high/low" dichotomous variable. A point biserial correlation would 
then be appropriate. Interpretation of the strength of the correlation would, of 
course, depend on the purpose of the correlation. For more information, see: 
Guilford and Fruchter (1978). 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO J 

Practice 15.6 

1. Can you think of an example from questionnaire research in which the way: 
respondents answered one question might be correlated with responses on parts: 
of the questionnaire or on the total? (If so, you've thought of a new application 
for point biserial correlation!) 


► 2. Calculate the point biserial correlation for the following data. The mean for 
Ss who answered a test item correctly was X = 80. The mean for 5s who an¬ 
swered the item incorrectly was X = 70. The proportion of 5s who answered the 
item right was .60. The proportion of 5s who answered it incorrectly was .40. 
The s was 15. Is this a "good" item for the test? Why (not)? 

V*. r- 

r pbi = jg— 


r pbi 


ooooooooooooooooooooooooooooooooooooo 
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Spearman Rank-Order Correlation (Rho) 

i vlost statisticians argue that the Pearson correlation should not he used unless 
the data are normally distributed (lor example, when we have a large sample size 
and the data are spread throughout the range). When the distribution is no: 
normal, they argue for Spearman's correlation. This might lead you to believe 
that the Spearman correlation would be used much more frequently than the 
Pearson However, this is not the case. 

The Spearman correlation is appropriate for both rank-order data and interval 
data with the strength of ranks. When the computer computes the value of p, it 
automatically changes the interval data to rank-order data (and thus loses some 
precision of information in the data). When you do the calculation by hand and 
have determined that interval data arc no: normally distributed but have the 
strength of ranks, you will arrange the scores in an ordei and assign them ranks. 
Rho (p) tells us how the rankings of data from two variables are related. 

In our field, Spearman rho has often been used to look at the order of morpheme 
acquisition by second language learners. If you collect test data (perhaps using 
Fathman s [1975] SLOPE test or the Bilingual Syntax Measure [1973] ) on a 
I group of second language learners, you can score each person's use of, say, the 
morphemes that have often been used in such studies. This would allow you to 
rank-order the morphemes for your group and then compare the obtained order 
with that shown in the literature for other second language learners. The ranking 
of the morphemes in your data would be compared with the ranking in previous 
studies. 

Another possibility would be to establish a scale (using the Gunman 
[inpliea'ional scaling procedure) for the grammar points taught in. say, a series 
for teaching beginning German. (This might, or might not agree with the se¬ 
quence in the materials.) The data that let you establish the scale might come 
from test items for each grammar point obtained from all the students in first- 
year German classes. These data are cross-sectional. To check whether this scab 
also reflects an acquisition order, you might then have a project where you 
tape-record at regular intervals a learner who is not receiving instruction in 
German but who is living in a German-speaking environment. You might set up 
a matrix of the grammar points and check to see at what point each grammar 
item appears with some set degree of accuracy in the tape-recorded data. Again, 
this would allow you to determine a rank-order of the grammar points. The order 
for the acquisition data could then be correlated with the scale from the cross- 
isectional accuracy data. 

It is also possible that you might make a case study of two learners and chart the 
accuracy of their use of a set of forms at one time. Each grammar point would 
be rank-ordered for each S and the two orders related using a Spearman p. 

As always, it will be much easier to compute the value of p if the data are first 
displayed on a data sheet. Let's imagine that you have the classroom data on the 
German test and that it has turned out to be scalable using a Guttman analysis. 
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In a case study of one learner, you have charted a ' percent correct in obligatory; 
instances" for each item over time. This has allowed you to rank-order the data.: 

The first step is to use the Guttman analysis order and, then, place the case study 
order next to it. 

Gr. Point Guttman Case Data 

9 1 3 

10 2 1 

8 3 2 

4 4 8 

3 5 7 

1 6 4 

2 7 6 

6 8 5 

7 9 10 

5 10 9 

The numbers in the first column are the grammar points (actually the I.D. num¬ 
ber represents the order in which these items are sequenced in the book, but that 
is not a concern in this particular study.) The number in the second column or¬ 
ders these points according to how accurately 5s in the German class used them 
on a test. It is rank-ordered. The third column gives the order for the same 
grammar points in the acquisition data. So, for example, the grammar point 
presentee first in the series was sixth on the Guttman scale in terms of difficulty 
and was acquired fourth overall by the case study learner. The grammar point 
presentee ninth in the series was the easiest for the students in the class and was 
the third item acquired by the case study learner. 

Once the two orders are displayed, the next step is to compute the differences in 
the rank-orders. We have placed this in a new column, labeled d for difference, 
and squared that value. (We squared the obtained values to get rid of negative 
numbers.) 


Gr. Point 

Guttman 

Case Data 

d 

d 2 

9 

1 

3 

-2 

4 

10 

2 

1 

1 

1 

8 

3 

2 

-1 

1 

4 

4 

8 

-4 

16 

3 

5 

7 

-2 

4 

1 

6 

4 

2 

4 

2 

7 

6 

1 

1 

6 

8 

5 

3 

9 

7 

9 

10 

-1 

1 

5 

10 

9 

1 

1 


The sum of the squared differences ( Y,cP ) for the above data is 42. With the in¬ 
formation displayed in a table, it is an easy matter to place the values in the for* 
mula for rho (p). The number 6 in the formula is a constant. It does not come 
from the data. All the other values, of course, do. 
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p= 1 - 


/V(A r2 — 1) 

The N for the study is 10 because the number of grammar points ranked is 10. 
(Notice that it is not 20-the number of items ranked in the class data plus the 
number in the case data.) Let's place these values in the formula and compute 
p (rho). 

6(Y> 2 ) 

P=l-- 

A ; f A' 2 — 1) 


10 ( 10 2 — 1 ) 
p = .746 


To interpret the obtained value of rho, once again we talk about the magnitude 
of rho and the direction of the relationship. The value, as with r, will always be 
somewhere between —1 and 0, or between 0 and +1. You can have a negative 
or a positive correlation (just as with r). The closer the obtained value of p is to 
+ 1.0, the stronger the relationship between the two orders. 

Unfortunately, we cannot talk about the strength of rho in terms of p 2 . That is, 
we cannot square the value we obtain and talk about the overlap between the two 
orders. This is because we have already squared differences in computing the 
value of rho. 

Instead, we can check for the significance of p using the correlation table, table 
9 of appendix C. While probability levels obtained for Spearman are more often 
used than they are with Pearson, again it makes more sense to talk about the 
strength of the correlation (how close to ± 1 it is rather than the fact that we feel 
confident in rejecting the H 0 of no relationship between the two ranks). The 
probability level does, however, let us say that the two orders are or are not re¬ 
lated. For this problem, the critical value of rho (with an N of 10) is .648 for a 
probability cutoff point of .05. We can, therefore, say the variables are related. 
The strength of the correlation, however, tells how well correlated they are. For 
the example above, we would conclude that, despite the obvious difference in ac¬ 
tual ranks between the accuracy and acquisition data, they are quite closely re¬ 
lated. Certainly we would feel that some reordering of presentation in teaching 
would be worth trying on an experimental basis. 

While the Spearman Rank-order correlation is used in many studies in our field, 
there are some problems with it. Researchers sometimes opt for another rank- 
order correlation, Kendall's tau. Kendall's tau is better able to handle the prob¬ 
lem of tied ranks. When there are a number of ties in the ranks, Spearman is not 
a good procedure to use. Computer packages adjust Spearman correlations for 
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tics, but hand computations are rather cumbersome so we haven't presented or 
discussed ihe formula here. If you have many tics, use a Kendall tau procedure 
instead. (See Daniel | 1978] for the formula to compute a Kendall's tau.) 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 15.7 

> l. Compute Spearman's rho for the following data. Students in a teacher 
training program were required to take 12 different classes to obtain their cre¬ 
dentials. After graduation, students from two succeeding years were asked to 
judge the "helpfulness" of each class as preparation for their teaching tasks. The 
data rank the classes as follows: 


Class 

Gpl rank 

Gp2 i 

1 

1 

3 

2 

2 

9 

3 

3 

5 

4 

4 

1 

5 

5 

2 

6 

6 

10 

7 

7 

4 

8 

8 

8 

9 

9 

12 

10 

10 

6 

11 

11 

7 

12 

12 

11 


How well do the two groups of graduates agree in their ratings of the usefulness 
of the classes? 


ooooooooooooooooooooooooooooooooooooo ; 

Kendall's Coefficient of Concordance 

In the Spearman correlation we are concerned with how well data on two vari¬ 
ables correlate. However, we cannot use the Spearman to look for a relationship 
among more than two variables. Kendall's W allows us to do this, the only stip¬ 
ulation being that the data are at least ordinal. That is, the data must have the 
strength of ranks. 

The following fictitious data set shows us the ranking by six "good language 
learners" of 10 factors which are assumed to relate to success in second language 
learning. As gifted language learners, they were asked to give a rating of 0 to 10 
for each factor. The factors were then converted to ranks so that for each learner 
the 10 factors are now rank-ordered. The 10 factors are: intrinsic motivation, 
extrinsic motivation, personal need achievement, field independence, integrative 
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motivation, short-term memory, tolerance for ambiguity, risk-taking, 
extroversion, and convergent thought processes. 

To set up a table for the Kendall Concordance, remember that the people who 
are doing the ranking go down the side of the table, and the items (or, in some 
cases, people) being rank-ordered go across the top of the table. The procedure 
will compare the ranks given by each 5 (the data shown in each row) across all 
5s. Your actual raw data may first consist of ratings or scores but they must be 
converted to ranks. 


5 

A 

B 

C 

D 

£ 

F 

G 

H 

I 

./ 


SI 

4 

6 

1 

2 

8 

10 

9 

3 

5 

7 

55 

S2 

5 

2 

8 

6 

I 

3 

7 

4 

9 

10 

55 

S3 

7 

1 

9 

5 

2 

4 

6 

3 

8 

10 

55 

S4 

6 

5 

2 

10 

8 

3 

4 

1 

7 

9 

55 

S5 

5 

7 

2 

1 

9 

8 

10 

4 

6 

3 

55 

S 6 

1 

4 

9 

7 

5 

3 

2 

8 

10 

6 

55 

Tot 28 

25.. 

31 

31 

33 

31 

38 

23 

45 

45 

330 


A = intrinsic motivation, B= extrinsic motivation, C = personal need achievement, 
D = field independence, E = integrative motivation, F = short-term memory, 
G= tolerance for ambiguity, H = risk-taking, I = extroversion, J= convergent 
thought processes. 

The H 0 would be that there is no relationship among the factors. If this is correct, 
then we would expect that the distribution of the ranks would be random. Since 
we have 10 factors, we would expect that each would receive a total of 
330 4 - 10 = 33 if there were no relationship. The totals at the bottom of the table 
vary from 28 to 45. 

The formula for Kendall's W is algebraically derived, so like most nonparametric 
procedures, it's not easy to see what's going on. The formula is based on a pro¬ 
cedure where the difference of the total ranks for each factor from 33 is found 
and squared and then totaled (sounds like an ANOVA sum of squares method, 
right?). The amount that these totals differ from 33 gives us a measure of asso¬ 
ciation among the 10 factors. 

(28 - 33 ) 2 + (25 - 33 ) 2 + (31 - 33 ) 2 + - + (45 - 33 ) 2 = 514 

Second, we need a measure of agreement in the ranks assigned to each factor. 
If there were a perfect association, the best factor (the one ranked highest by ev¬ 
eryone) would be ranked 1 by all learners and thus have a total of 6 . The second 
best should be ranked 2 by all learners for a total of 12. The third best should 
have 3x6 = 18 points and the worst 5 should have 10 x 6 = 60 points. Look at 
the total points for each factor. They are not close at all to this "perfect" ranking. 

If there were a perfect correlation, then we could subtract the expected total rank 
for each factor (33) from the perfect rank scores of each factor, square these and 
total the squared differences: 
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(6 - 33 ) 2 + (12 - 33 ) 2 + (18 - 33 ) 2 + - + (60 - 33 ) 2 = 2970 


The ratio of these two measures will always be somewhere between 0 and I. If 
it isn't, recheck your calculations. The correlation in our case is quite low: 


514 

2970 


0.173 


The formula for Kendall's W, however, is not so transparent since mathemati¬ 
cians have done what is called "appropriate algebraic manipulation'' to make it 
more precise. The formula is (gasp?): 


12 ^)??- 3 m 2 n(n+ l ) 2 
7=1 


Don t worry about the formula. It will be clear in a minute. The easiest parts 
are the 12 and 3 because they are constants in the formula. That is, they are part 
of the formula - they don't come from the data. 


Step l 

hind the value of Y,fi} 

/-> 

To do this, we simply square the total for each factor (the "total" row at the bot¬ 
tom of the data table) and sum these: (28’ t 25- 4 — t 45‘ 2 ) = 11404. 


Step 2 

Multiply the result of step I by 12 (the constant in the formula). The answer for 
this data is 136848. 


Step 3 

Subtract 3 m 2 n(n + l ) 2 

So, we multiply 3 by m 2 . m is the number of learners doing the rankings. We 
have 6 learners, n is the number of factors (or objects or 5s) being ranked, and 
that's 10. So 3 x 6 2 x 10(10 4- l ) 2 = 130680. 

Step 4 

Find the value of m 2 n{n 2 - 1). The result is 35640. 

Step 5 

Enter the values from steps I through 3 in the formula and complete the calcu¬ 
lation. 

136848 130680 6168 

W ~ - SSi -'1SM0“- 173 
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As you can see, the actual Kendall's Concordance figure matches that given in 
our explanation even though "algebraic manipulation" has been used to correct 
the basic formula. 

To interpret the coefficient, we must convert the value of W into a x 2 statistic by 
using the following formula: 

X 2 = m{n- WV) 

Since m in this example is 6, n is 10, and the obtained value for W was .173, the 
X 2 value is: 

* 2 =6x9x.173 = 9.34 

Checking this value of x 2 in the Chi-square table (table 7 of appendix C), we see 
that wc need to know the df to read the table. Our df is n — I or 10—1=9. 
The x 2 critical value needed for 9 df is 16.92. So we cannot reject the H 0 . We 
have to conclude that there is no significant relationship (agreement or disagree¬ 
ment) in how these learners view the importance of these factors. 


ooooooooooooooooooooooooooooooooooooo 

Practice 15.8 

► 1. At the conclusion of instruction, your language department asks students 
to rank-order the usefulness of various classroom activities. You want to know 
whether there is general consensus among the students as to what is useful and 
what is not. There are 10 activities: (a) composition, (b) composition revisions, 
(c) listening comprehension activities, (d) vocabulary exercises, (e) grammar ex¬ 
ercises, (f) examinations, (g) problem solving activities, (h) diary writing, (i) 
watching target language soap opera, (j) activities based on soap opera. Here arc 
the fictitious data: 
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5 

a 

b 

c 

d 

e 

/ 

g 

h 

i 

j 

1 

9 

10 

6 

1 

7 

8 

2 

4 

3 

5 

2 

10 

9 

7 

2 

6 

8 

3 

5 

4 

1 

3 

9 

8 

6 

1 

5 

10 

4 

3 

2 

7 

4 

4 

9 

6 

8 

10 

3 

7 

2 

5 

1 

5 

10 

9 

6 

1 

7 

8 

3 

5 

2 

4 

6 

9 

10 

6 

1 

7 

8 

3 

5 

2 

4 

7 

6 

7 

l 

4 

9 

2 

10 

5 

8 

3 

8 

10 

9 

7 

2 

5 

8 

3 

6 

4 

I 

9 

9 

10 

6 

1 

8 

7 

3 

5 

1 

4 

10 

8 

6 

2 

5 

10 

9 

7 

4 

3 

I 

11 

9 

8 

6 

1 

5 

10 

4 

3 

2 

7 

12 

9 

10 

6 

l 

7 

8 

4 

5 

3 

2 

13 

1 

6 

7 

10 

9 

8 

2 

4 

3 

5 

14 

9 

10 

6 

1 

8 

7 

3 

5 

2 

4 

15 

6 

3 

10 

8 

1 

2 

5 

4 

7 

9 

Tot 

118 

124 

88 

47 

104 

106 

63 

64 

52 

59 


The Hr is that there is no agreement among the 5s as to the relative merit of the 
different activities used in class. 

Compure IV. 

n 

12 Yk- - 3 m 2 n(n + l) 2 

w—t. '- 

m"n[n 2 - 1) 

IV = _ 

► 2. The coefficient of concordance is higher this time than in our previous ex¬ 
ample. To interpret the values, though, we must check the critical values for an 
.05 level (or less, depending on the level you select for rejecting the H 0 ). So, once 
again, convert the value to a x 2 statistic: 


Cheek the Xcru for n ~ • 9 df in the Chi-square table in the Appendix to see if 

we can reject the H 0 . Can we reject the H 0 1 Why (not)? What can you conclude? 


ooooooooooooooooooooooooooooooooooooo 

As always, the calculations are tedious when we must do the work by hand. I he 
problem is compounded when we have ties. When two or more observations are 
equal, we assign each the mean of the rank positions for which it is tied. The 
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problem is that the formula must then be adjusted to take care of the ties. lo 
do this wc change the denominator of the formula to: 

m 2 n(n 2 

t stands for the number of observations in any set of rankings that is tied for a 
given rank. 


phi: Correlation of Two Dichotomous Variables 

It's true that when wc work with nominal variables, our first thought is to turn 
to a Chi-square analysis. However, in language testing the phi correlation is an¬ 
other possibility. It looks at association in two dichotomous variables. That is, 
each variable is a dichotomy measured as yes/no, present absent, 
correct incorrect, pass, fail, etc. 

In language testing, the relation of two test items might be assessed according to 
how many people failed both items, passed both items, or got item A correct a id 
item B wrong or item A wrong and item B correct It's possible to use the pro¬ 
cedure. however, in other areas of language research as well. For example, we 
could test the relationship between early (kindergarten to grade .$) vs. late (grade 
4 to grade 6 ) instruction in FI.F.S (foreign language in elementary schools) end 
subsequent study of foreign languages—yes/no—in high school. Or, we might 
correlate Korean surname-yes/no-of Americans applying for a university s study 
abroad program to Korea or other countries—again, yes/no in respect to request¬ 
ing Korea. The table for each of these would look something like this: 


no 


yes 


The formula for phi is: 

r P hi 


Let's add data to our table and calculate an example phi coefficient. As an ex¬ 
ample, you know that most people who apply to applied linguistics doctoral pro¬ 
grams send applications to several schools. Imagine that university A (our very 
own) wants to know whether the result of its selection process correlates with that 
of university B (our cross-state rival institution). Here are the figures: 


Reject 

Accept 
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Reject Accept 


6 

36 

8 

6 


no yes 


A 

B 

C 

D 


_ EC-AD _ 

J(A + BXC+DXA + C\B + D) 




_ DC - AD _ 

Ph ‘ ^(A + B\C + D\A + C\B + D) 
128-36 

r ”' U ^(22X14X14X22)' 


The value of the phi coefficient is converted to a x 2 value by placing it in the 
equation: 

y 2 = A'tp 2 

For our data, this would be: 

x 2 = 36(.299 2 ) 
y 2 = 3.21 

We can then use our regular Chi-square table in appendix C to find the critical 
value needed to reject the null hypothesis of no relationship between the two 
schools' acceptance, rejection results The '/In, needed is TX4. I he correlation is, 
indeed, weak. 

There ;s a second formula that some researchers find easier to use. It is; 


r phi = 


Pik ~ PiPk 
-4PfiPlSk 


I he subscripts i and k refer to the two parts of the dichotomy being tested (e.g., 
yes vs. no. pass vs. fail, and so forth). In our example, i would be acceptance at 
university A and k would be acceptance at university B. In the formula p ik ; the 
proportion of Ss getting both "correct," p t the proportion getting i correct. p k 
= proportion getting k correct, q, - proportion getting i wrong, and q k - pro¬ 
portion getting k wrong. Again, once the value of <f> is found, it is placed in the 
y 2 formula, and the result is checked against the critical value in the Chi-square 
tabic to determine whether the null hypothesis of no relationship can be rejected. 


As you can see, a phi correlation is very similar to the Chi-square test. In Chi- 
square, wc first test to see whether a relationship cxists--that is, can we reject the 
null hypothesis? Then, as a follow-up to sec just how strong the relationship is 
in the data, we can use phi. Here, wc begin by asking whether the two variables 
arc related. The value of $ tells us how related the variables are. We can then 
ask whether this correlation will allow us to reject the null hypothesis by inserting 
the value into a y 2 formula. 


The phi correlation has been subject to some criticism when applied to language 
tests (where correlations are carried out between test items each of which are 
correct or incorrect). Many test experts have turned to tetrachoric correlation as 
a correlation less sensitive to possible distortion caused by the relative difficulty 
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of each test iicm. To learn more about this correlation and the use of interitem 
correlation in factor analysis, please consult Davidson (19X8). 


ooooooooooooooooooooooooooooooooooooo 

Practice 15.9 

► l. The number of American children who were early entrants (K-2nd grade) 
into a Spanish immersion program was 82; the number of late entrants (3rd-5th 
grade) was 112. The number of Ss who elected to continue Spanish in high school 
from the early group was 68; from the late group, 71. Is there a significant O 
coefficient for the correlation? 




OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

: We have discussed five kinds of correlations in this chapter: Pearson, point 
; biserial, Spearman's rho, Kendall's IK, and phi (<t>). There arc many other kinds 
of correlations, some of which will allow you to work with curvilinear data (that 
| obtained for scores on such variables as anxiety). For these, we suggest you 
consult your local statistician or textbook references such as Mays (1973) or 
Daniel (1978). Perhaps the best reference to consult for all correlations is 
Guilford and Pruchicr (1978). 


Activities 


1. Check your research journal to find examples of studies where you would wish 
o run a correlation. What are the variables? Do you expect to find a positive 
or a negative correlation? Which type of correlation will you use? What strength 
of correlation do you hope to obtain? Why? 

2. R. Guarino & K. Perkins (1986. Awareness of form class as a factor in ESL 
reading comprehension. Language Learning, 36, 1, 77-82.) gave 35 ESL learners 
a standard reading comprehension test and a test meant to assess knowledge of 
form class, or a word's morphemcs/structural units. Results indicated a signif¬ 
icant relationship between the two tests. Although the authors concluded that 
awareness of form class relates to success in reading, they cautioned that reading 
also involves metalinguistic skills. If this were your study, approximately what 
strength would you want the correlation to obtain in order to claim the two var¬ 
iables are related? Justify your choice. 

3. Look back at the correlation table for the ESL placement subtests (page 443). 
Given that the correlations are so very high among the subtests, do you think that 
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ihe point biseiial conflation figures for test items within each subtest will always 
show stronger correlation with the subtest they represent than with other suh- 
tests? Why (not)? 

4. T. Pica, R. Young & C. Doughty (1987. The impact of interaction on com¬ 
prehension. TESOL Quarterly, 21 , 4. 737-758.) used point biseriai correlation in 
an interesting way. They compared NNSs' comprehension of directives under 
two conditions-prcmodificd simplified input and negotiated input. Before an¬ 
swering this question, to be sure the premodified simplified commands were good 
items to use, they ran a point biserial correlation on each directive. Of the 15 
commands, 13 had a r pbl of .30 or above. In your own words, explain how the 
researchers (or their computer program) perormed the point biseriai correlation. 

5. E. Gcva (1986. Reading comprehension in a second language: the role of 
conjunctions. TESL Canada Journal, Special Issue /. 85-110 ) designed several 
tasks to investigate the importance of intrasentential, intersentential, and dis¬ 
course level comprehension of conjunctions on reading comprehension. Sixty 
university ESL students performed a Till in the blank" task which tested com¬ 
prehension at the intrasentential level (c.g., 'We could not see the man, although 
_" with four options given for the blank), a sentence continuation task (e.g.. 

It was cold outside, although it was sunny with two choices for continuation, 
"So it was a good day for skiing' or So Johnny's mother made him wear a 
sweater "). Two methods were used to test intersentential comprehension: a cloze 
passage to test comprehension at the discourse level, and comprehension tests 
based on academic text passages. Oral proficiency ratings were also available for 
each student. 

Consider the following correlation matrix to decide whether oral proficiency is 
related to the tasks which test conjunction comprehension. Mow would you in¬ 
terpret the correlations between oral proficiency and the comprehension of con¬ 
junctions at intrasentential, intersentential, and discourse levels? Mow would you 
interpret the correlation of cloze scores (the discourse level measure) and ATC? 

Intercorrelations Between Oral Proficiency Ratings, Predictor 
Variables and Total Score on the Dependent Measure 



Oral 




ATC 


Prof. 

FBT 

SCT 

Cloze 

(total) 

Oral 





Prof. 

1.00 





FBT 

.37 

1.00 




SCT 

.24 

.43 

1.00 



Cloze 

.69 

.26 

.30 

1.00 


ATC 

(total) 

.43 

.27 

.40 

.49 

1.00 


FBT = fill in the blank task, SCT = sentence continuation task, ATC = aca¬ 
demic text comprehension 

6 . R. Altman (1985. Subtle distinctions: should versus had better. Research Note, 
Studies in Second Language. Acquisition. 8, I, 80-89.) developed and administered 
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>evcral measures of modal usage to 12 nonnative and 12 native speakers of Eng¬ 
lish. I he results showed that understanding of expressions of modals of obli¬ 
gation is different for these native and nonr.ative groups. Examine the following 
table which reports the value of W (Kendall Concordance analysis) within each 
group on one of these measures, that of modal strength. 

Average Ranking of Seven Modal Test Items 
(from strongest to weakest) 

Normative speakers * * 

1 must 

2 have to 

3 should 

4 BE supposed to 

5 could 

6 can 

7 'd better 

*W=.90,p < .001 **W .615. p < .005 

Interpret the meaning of the IK statistic for each group. Then look at the order 
shown for each group. For which modals arc the ranks most similar, and for 
which arc they dissimilar'? If you are interested in modals of obligation, yon wi.l 
want to consult the other measures Altman devised (not. analyzed using this pro¬ 
cedure) which were administered to larger groups of .Vs. 

7. V. Nell (1988. I hc psychology of reading for pleasure: needs and gratifica¬ 
tions. Reading Research Quarterly, 23, I, 6-50.) explored (among many other 
tilings) readers' ranking of books for preference and literary merit. I hrcc groups 
of 5s (129 college students, 44 librarians, and 14 English lecturers, or "profes¬ 
sional critics") read short extracts for 30 books which spanned the continuum 
from highest literary quality to total trash. 5s ranked the extracts for preference 
(what they would read for pleasure) and merit (what is good literature). 

Kendall's Concordance for preference rankings and for merit rankings showed 
that within the three groups of 5s, rankings were significantly similar. Spearman 
rho coefficients showed that preference rankings were the same for students and 
librarians; for merit rankings, all three groups ranked the extracts in significantly 
similar sequences. The author concludes that all groups share a common set of 
literary value judgments. 

In addition, negative Spearman rho coefficients indicated that merit and prefer¬ 
ence ratings are inversely related. For all groups, the higher merit an extract was 
given, the less desirable it was for pleasure reading. 

If this were your study, what coefficient values (i.e., what range of values) would 
you want to obtain in order to make these claims for the findings? Justify your 
decision. 


Native speakers* 
must 
have to 
'd better 
should 

BE supposed to 

can 

could 


Chapter 15. Correlation 463 


8 . K. Bardovi-llarlig (1987. Markedness and salience in second language ac-: 
quisition. Language Learning, 37, 3, 385-407.) examined ihe acquisition of pre¬ 
position "pied piping" by 95 F.SL learners representing 8 proficiency levels. An 
elicitation task was used to look at preposition use for dative Wh-qucstions and 
relative clauses. Results indicated that 5s show an order of No Preposition > 
Preposition Stranding > Pied Piping. Examples given for No Preposition are 
"Who did Mary give a book?" and "The man Mary baked a cake was Joe"; for 
Preposition Stranding, "Who did Phillip throw the football to?" "The guard was 
watching the player who Phillip threw the football to"; and for Pied Piping: "For 
whom did George design the house?" "The teacher helped the student for whom 
the lesson was difficult." 

A Kendall tau correlation for two structures (Wh-questions and relative clauses) 
showed the order of No Prep > Prep Stranding > Pied Piping to be correct: tau 
= -.25, p < .01. The Kendall tau for another order, which would be predicted 
by markedness theory (namely No Prep > Pied piping [unmarked form]) > 
Prep Stranding [marked form]), had a tau = .05, n.s.. The author explains this? 
apparent counterexample to marked ness theory by suggesting that salience also : 
plays an important role. 

If you are interested in markedness theory, read the article. Consider other ex^: 
planations you might be able to give for the findings. Also consider what ad-: 
vantages the Kendall tau procedure has to either the Spearman Rank-order 
correlation or a Pearson correlation. If you were to replicate this study, how; 
would you decide which correlational or other parametric and nonparametric 
procedure to use for the data? 
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Chapter 16 
Regression 


* Linear regression 
•Multiple regression 

Restrictions related to multiple regression 
Interpreting multiple regression 
Relation between A MOCA and regression 

! When wc calculate the mean (/V), we thinkof it as the central point of the distri¬ 
bution. If wc gave a test, calculated the .V, and then found that one of our Ss 
missed the test, our very best guess as to that S' s score (assuming we know 
nothing about the S s ability) would be the A'. However, if wc do know some¬ 
thing about the student, wc may be able to make a better guess. If the S' has 
>hown outstanding abilities in the classroom, we would likely guess a score higher 
than the .V If the S was not doing well in class, we d guess some score lower 
than the .V. 

When we have no other information, the best guess we can make about anyone's 
performance on a test is the X. If we knew, on the basis of a strength of associ¬ 
ation //-’ (eta squared) that some of the variance in performance on this test was 
due to sex or on the basis of (omega squared) that some was due to l.l mem¬ 
bership, then by knowing the student's LI and sex, we could improve our guess 
If wc knew, on the basis of a Pearson r 7 that there was a strong overlap of scores 
on this test and scores on the SAT, then by knowing the student's score on the 
SAT, we could improve our guess. If we knew that SAT scores, sex, and LI 
membership (and perhaps length of residence or age of arrival or a host of other 
variables) were possibly related to performance on this test measure, then by 
having all this information on the student, we could be quite accurate in predict¬ 
ing the student's score. We would not need to guess the X ; we could be much 
more accurate in our predictions. 

Regression, then, is a way of predicting performance on the dependent variable 
via one or more independent variables. In simple regression, we predict scores 
on one variable on the basis of scores on a second. In multiple regression, wc 
expand the possible sources of prediction and test to see which of many variables 
and which combination of variables allow us to make the best prediction. 

You car. see that simple regression is useful when we need to predict scores on a 
test on the basis of another test. For example, we might have TOEFL scores on 
students and use those scores to predict the 5s' scores on our LSL placement test. 
If accurate predictions arc possible, then we can spare students the joy of taking 
another test and ourselves the correcting of more exams. 
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Multiple i egression is more frequently used when we want to know how much 
weight" to give to a variety of possible independent variables that relate to 
performance on the dependent variable. For example, if wc have an achievement 
test for Spanish and previous research has shown that success on this test is re¬ 
lated to a number of things (e.g., one study shows that age at which Spanish in¬ 
struction was begun relates to achievement; another study shows that amount of 
interaction with native speakers is related to achievement; another that reading 
scores relate to achievement; and so forth), wc can determine which of all these 
variables best predicts achievement and or which combination of these variables 
most accurately predicts achievement, and which do not add much information 
to the prediction. 


Linear Regression 

Regression and correlation are related procedures. The correlation coefficient, 
which we discussed in the last chapter, is central to simple linear regression. 
While vve cannot make causal claims on the basis of correlation, wc can use cor¬ 
relation to predict one variable from another. 

Let's assume we have the following scores i\y Ss on two tests, a reading test score 
and a cloze test score. 


s 

Reading 

Cloze 

1 

10 

5 

2 

20 

10 

3 

30 

15 

4 

40 

20 

5 

50 

25 


The data form a perfect relationship (r = l,r 2 ~ 1). For each increase of 10 
points in reading, there is a corresponding increase of 5 points on the cloze test. 
With such a perfect relationship, you can predict with complete accuracy where 
any S will score on one test on the basis of the score on the other. 

If the correlation were ever perfect, all we would need to do to predict one score 
on a variable from the score on the second is to locate the person's score on one 
axis of the scatterplot and then look up to where the line drawn from that score 
hit the correlation line, and then check across. 
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X (r = 1.0) 


The chance of ever finding a perfect correlation is, of course, next to never. In 
addition, few people want to go to the trouble of drawing a scattcrplot and then 
plotting out all the scores to see where the scores intersect in order to find the 
score of Y given the score on X. 

Since correlations are not perfect, the scores will not all fall right on our imagi¬ 
nary straight line. For example, here is a scattcrplot that shows the (fictitious) 
correlation of scores on the MLAT (Modern Language Aptitude Test) and an 
achievement test on learning a computer language. The research question asked 
was whether there is a relationship between language learning aptitude and the 
learning of an artificial machine language. 


80 

COMP 70 
TEST 

60 

50 

40 


65 70 75 80 85 90 95 100 

MLAT 


If we drew in the best-fitting straight line for our correlation, it might look 
something like that drawn on the scatterplot. Look at the circled dot on the 
scattcrplot. This is the intersection of one S 's score on the MLAT (75 points) and 
the score on the computer-language achievement test. If we drew a straight line 
|up to the best-fitting straight line, we would expect the score on the computer- 
language achievement test to be around 61. However, it is not. The actual score 
was 56. We make a 5-point error in our estimate for this particular .S'. 
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I he closer the correlation is to + I, the smallei the error will be in predicting 
performance on one variable to that of the second. The smaller the correlation 
(the closer it is to 0), the greater the error in predicting performance on a second 
variable from that on the first. Think back to the various scatterplots we showed 
in the previous chapter. When the value of r was close to + I. the points clus¬ 
tered close together along the best-fitting straight line. Thus, there would be little 
error in predicting scores in such a distribution. In analyzing your own data, you 
might first ask the computer to give you a scattcrplot in order to sec the corre¬ 
lation and also to allow you to identify cases (if there are 8110111 where 5s do not 
seem to fit the correlation "picture." The printout will also allow you to see the 
amount of error (sometimes that means more to us than seeing a numerical value 
for error); the error, however, will be labeled "residual" in most computer pro¬ 
grams. 

Four pieces of information are needed to allow us to predict scores using re¬ 
gression. These are (1) the mean for scores on one variable (,Y), (2) the mean for 
scores on the second variable (K), (3) the 5's score on X , and (4) the slope of the 
best-fitting straight line of the joint distribution. With this information, wc can 
predict the S's score on Y from Yon a mathematical basis. By "regressing" V'on 
X, predicting Y from X will be possible. 

We already know how to compute X , so that part is no problem. However, we 
have no: yet discussed how to determine slope (which is usually abbreviated as 
b). If we try to connect all the dots in scatterplot, wc obviously will not have a 
straight line (more likely a picture of George Washington?). We need to find the 
best-fitting straight line. Look at the following scatterplot. You can see that lines 
drawn to the straight line show the amount of error. Suppose that we square 
each of these errors and then find the mean of the sum of these squared errors. 
This best-fitting straight line is called the regression line and is technically defined 
as the line that results in the smallest mean of the sum of the squared errors. 
We can think of the regression line as being that which comes closest to all the 
dots but, more precisely, it is the one that results in a mean of the squared errors 
that is less than any other line wc might produce. 
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Before we give you the raw score formula for b (slope of the regression line), let's 
make a link between the concept of slope and that of correlation. Imagine that 
■we had the scores on the computer test and on the Ml.AT. Because the scores 
come from different measures, we convert each of these scores to a z score. Th.s 
makes them comparable. Then we could plot the intersection of each .Vs z score 
on the MLAT and on the computer test. These would form the dots on the 
scatterplot. I he best-fitting straight line would have a slope to it. As the - scores 
on the MI.AT increase they form a "run," the horizontal line of a triangle. At the 
same time, the ? scores on the computer test increase to form x "rise," the vertical 
line of the triangle. The slope ( b ) of the regression line is shown as we connect 
these two lines to form the third side of the triangle. 



A ^— Run —^ B 


In the above diagram, an increase of six units on the run (MLAT) would equal 
two units of increase on the rise (computer test). The slope is the rise divided by 
the run. The result is a fraction. That fraction is the correlation coefficient. The 
correlation coefficient is the same as the slope of the best-fitting line in a z-score 
scatterplot. If, in our example, we knew that the standard deviation on the 
MLAT is 10, a z score of I is 10 points above the mean. A z score of 2 is 20 
points above the mean. Imagine, too, that the standard deviation on the com¬ 
puter test was 8 points. A +2 z score would be 16 points above the mean. A 
-1 z score would be 8 points below the mean. For a z-score scatterplot, we know 
that the slope of the best-fitting straight line is related to the correlation coeffi- 
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cient. In the little triangle we showed that the slope of the regression line was ^ 
2 -r- 6, and so the correlation coefficient for the two is .33. We now know that : 
each shift in units on the horizontal axis (X) equals 10 and each shift on the ver- i 
tical axis (10 equals 8. Therefore, the slope will be .26. That is, 2 z units when 
the s is 8 = 16, and 6 z units for an 5 of 10 = 60, and 16 ~ 60 = .26. All we 
have to do to obtain the slope is to multiply the correlation coefficient by the 
standard deviation of Y over the standard deviation of X. 

L S Y 

h -'•'••ir 


b = .33- 


b = .26 

It is very easy to find the slope if you have the correlation coefficient and the 
standard deviation for X and Y. However, sometimes wc do not already have the 
correlation coefficient, and do not have the time to compute and plot out the z 
scores for each 5 on the two measures. Still, you can as easily compute b using 
raw score data. The formula for slope follows. 

Whenever yuu need to do regression but do not yet have the correlation, you car 
use this "long" formula and work from the raw data. 

To be sure that you can compute the slope given the necessary information, let's 
work through another example. Imagine that you have given the TOEFL exam 
to a group of students hoping to come to the United States to complete their 
graduate degrees. The government sending these students is, however, concerned 
that the students should not only show abilities on the TOEFL but should also 
do well on the Test of Spoken English (TSE). So the students also are given this 
test. Afterwards, you wonder how related the two tests are and whether you 
could accurately predict a S's score on one test (the TOEFL) given the score on 
the other (TSE). 

The X on_the TOEFL for your group was 540 and the standard deviation was 
40. The X on the TSF. was 30 and the standard deviation was 4. The correlation 
computation showed that r = .80. We can, then, determine the slope. 

t S Y 

b=r w-T7 


b = .80 x ~ = 8.0 
4 
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The scatter plot might look like this. Notice that we have marked the means on 
each test as X and Y . The point where these two means intersect has been la¬ 
beled wTli a zero (0). Notice that the scattcrplot is set out in z score intervals and 
the A' for each, therefore, is a z score of 0. The regression line passes through ihis 
intersection. 



Assume that after the last administration of the TOEFL in this country, a new 
candidate for study abroad appeared, it would be possible to give her an inter¬ 
view for the TSF.. Once you obtained her score of 36 on the ! SH. you could 
convert it to a . score and place it in the scattcrplot diagram, look across and 
discover what the corresponding ? score might be on the TOEFL. 

The problem is that nobody (that we know) ever actually converts ail the scores 
to .• scores and draws a scattcrplot that would allow one to do this. Instead, we 
need a mathematical way of computing this result. 

We know that the slope is 8 (b = 8). So for every 4-point shift in the X line, there 
will be a 32-point shift in the Y line (4 x 8 32). For a 2-point shift in A', there 

will be a 2 x 8 16-point shift in Y. Looking back at the diagram, we see that 
if we begin at the intersection 0 and move to the right on the X axis (above the 
TSE mean), there will a consequent increase on the Y axis (above the TOEFL 
mean). The amount of increase depends on the slope. We can calculate this in¬ 
crease without reference to the scatterplot by using the slope. For example, with 
a TSE score of 36, the shift from the mean of 30 is 6 potnts. Multiplying that 
by the slope, we get 8 x 6 = 48. So our prediction of the TOEFL score is the 
mean of Y, 540, plus 48, or 588. 

Let's convert all these words into a formula for predicting an individual S's score 
on the Y variable given his or her performance on the X variable. The formula 
follows. 

Y (predicted Y) = 7 + b(X - X) 

Y = 540 + (8X36 - 30) 

Y 588 
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The formula for this prediction is sometimes called the regression equation or the 
prediction formula. Another regression equation that is more useful with com¬ 
puter output is: 


Y — a + bx 

In this formula, a is the y-intcrccpt, information which is routinely generated by 
the computer, b is the slope, and x is the score we use to predict y. Obviously, 
this formula is much easier if you arc using a computer package that gives you 
this information. 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 16.1 

1. Look at the following scatterplots. In which would the error (residual) be 
greatest and in which smallest? 




► 2. In the TSE and TOEFL example, assume that once you made an exception 
and interviewed the first person, another candidate appeared and demanded like 
treatment. Without information from the TSE, what is the best estimate for the 

TOEFL score of this candidate?_This time the S 's TSE score was 

below the mean for that test, a scoreof 25. The student's predicted score on the 
TOEFL exam would be below the X. Calculate the predicted value. How dif¬ 
ferent is this prediction from that obtained without the TSE information? 

Y= Y + b(X— X) 

Y = _ 


474 The Research Manual 


Difference in predictions: 


ooooooooooooooooooooooooooooooooooooo 


We will do one more example of simple regression, working directly with raw data 
this time. Imagine that you have scores on ^n error detection test for a group of 
5s. This is a test where students are given sentences containing grammar errors. 
The 5 must correctly identify the location of the error. Say there arc 15 such 
items so the 5 has a possible score of 0 to 15 on the test. In addition, you have 
scores derived from scale ratings given by teachers to their written compositions. 
After the tests and compositions have been scored, you hope to be able to use the 
error detection test (since it is so easy to score) to predict the next .S's score on 
composition. 


* 

Y 

* 2 

Y 2 

XY 

1 

5 

1 

25 

5 

2 

6 

4 

36 

12 

3 

8 

9 

64 

24 

4 

9 

16 

81 

36 

5 

8 

25 

64 

40 

6 

10 

36 

100 

60 

7 

13 

49 

169 

91 

8 

12 

64 

144 

96 

9 

13 

81 

169 

117 

10 

11 

100 

121 

110 

11 

13 

121 

169 

143 

12 

14 

144 

196 

168 

13 

14 

169 

196 

182 

14 

13 

196 

169 

182 

15 

13 

225 

169 

195 

I 

120162 

1240 

1872 

1461 


X x = 8.00; X y = 

10.8 ;s x = 

4.47; s y = 2.96 



Let's do a correlation and then use the correlation and information on the slope 
to predict composition scores given error detection scores. 

The first step is to calculate the correlation between the two measures: 


^/[TV^X 2 - (£J0 2 ][,V)T Y 2 - (£ if] 


Chapter 16. Regression 475 



(15XH61) — (120X162) 


V [15(1240) — (120r][ 15(1872) — (162H 
.89 


xy 


The second step is to determine the slope. 

b = 


x > s x 

b = .$9 / 

4.47 
b~ .59 

Since we had such a high value for r, we should feel confident in predicting the . 
composition scores from the written error detection test. The overlap between the 
two measures is .79. The next student who missed the composition section of the 
test scored 10 on the error detection test. Let's calculate her predicted composi- : 
lion score. 

Y = Y - b{X - X) 

Y= 10.8 + .59(10 - 8) 
r = i i.98 

Just to be sure all our calculations are correct, let's use the raw score formula for 
slope, and see if the results match. 


Hooray, they match! 


N(£x 

(^*) 2 

(15X1461) — (120X162) 


(15X1240)-(120)' 

6 = = .59 

4200 


In our discussion so far, we have used correlation to help us improve our guess 
when wc want to predict a person s score on the basis of his or her score on an¬ 
other test. When we do this, we know that the prediction will not be exact. There 
is error" in the prediction. This is because the correlation between the two tests 
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f is not perfect. If wc continuously make errors in our predictions (the residuals 
pare large), the question we need to ask is how much do we gain by using simple 
|regression for this purpose. We know that our best guess, if we have no other 
linformation, is the X. The amount we can improve on that guess is determined 
by the strength of the correlation between the two measures. If the correlation 
|.)s relatively weak, we gain a little but not much in prediction. If the correlation 
l.is strong, then our guessing improves a great deal. If you visualize the scatterplot 
of a weak and a strong correlation, you can see why this is the case. In a strong 
correlation, the scores all cluster around the regression line. When the scores are 
close to the line, there is only a small amount of error involved in the prediction. 
When the scores are spread out away from the line, then the amount of error in¬ 
creases. The weaker the correlation, the more error wc make in our predictions. 

Let's think about this issue in another way. Imagine that we have found a cor¬ 
relation of .50 between the type/'token ratios in Ss' writing (a rough measure of 
vocabulary flexibility) and their reading vocabulary scores. The mean for the 
reading vocabulary test was 35 and the s was 8. If you plot the type/token ratios 
with_vocabulary scores on a scatterplot, you can see the difference between using 
the X as the best estimate vs. using the regression formula. 



Remember that there will be some overlap in the variance of the two scores. 
When we square the value of r, we find the degree of shared variance. Of the 
original 100% of the variance, with an r— .50, we have accurately accounted for 
125% of the variance using this straight line as the basis for prediction. We have 
reduced the error variance now to 75%. So we have substantially increased the 
accuracy of our prediction by using linear regression rather than the X, but there 
pis still much error in the prediction. 

When we square r, we have the proportion of variance in one variable accounted 
for by the other. When we remove that proportion of variance from 1.00 (the 
total variance), we remove that overlap and what is left is the error associated 
| with predicting from the regression line. We can change this into a formula which 
| will give us the standard error of estimate, which is usually symbolized as s yx or 
| SEE in computer printouts. Visualize the dispersion of the data around the re- 
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gression line m analogy to spread of scores in a normal disuibulion. The straight 
line represents all the predicted Y values. 1'he "data" around the line are the ac¬ 
tual Y values. The standard deviation is a figure that tells us about variability 
of individual scores away from the mean. In regression, SEE does the same job 
regarding the dispersion of scores away from the straight line. In the case of; 
standard deviation, the larger the standard deviation, the greater the dispersion 
of scores away from the mean; the larger the SEE, the greater the dispersion from 
the straight line. In a normal distribution, if the standard deviation is very small, 
we would not make much error in predicting new scores using the mean. In re¬ 
gression. if all the data arc tightly clustered on the line, little error is made in 
predicting scores on one variable from scores on a second variable. 

SEE is an important figure for it tells us how much error is likely to occur in 
prediction. That is, the SEE gives us a more exact idea of how far off our pre¬ 
diction may be. Once you have predicted a score, you should look at the SEE tc 
see if it is large. The larger the SEE, the greater the amount of error in predic¬ 
tion. If it is very large, then you may do just as well using the mean. (There is 

no set "too high" value. You might judge this value in the same way that you do 
standard deviation. There is no set "too high" value for standard deviation, bu: 
we know, in both cases, that the value gives us a way of judging how far off we 

might be. For SEE, it tells us how great the error may be in prediction. Ir. 

standard deviation, it also gives us a rough estimate of how far off we might be 
if we thought that all scores were similar to the mean.) 

To compute SIT:, we need to know the error variance. The error variance is the 
sum of squares of actual scores minus predicted scores divided by A' 2. 

>>-?r 

Error Variance —-:- 

N 2 

The square root of this variance is referred to as the SEE: 


SEE = 



Another formula for SEE that uses values often more easy to find is the follow¬ 
ing: 


SEE = s y 



For the error detection problem, s y = 2.96 and r yx — .89. So, we can easily com¬ 
pute the SEE for those data. 


SEE = 2.96V 1 - .89 


SEE = 2.96 x .458 


SEE= 1.35 
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If out Jala are normally distributed, this means that 68% of actual Y scutes 
would fall within + s yx , or + 1.35 of the predicted Y score. 

One way to interpret the SEE is to make a confidence internal around the re¬ 
gression line. If wc had a set of predicted Y scores and actual >' scores and cal¬ 
culated the difference between each predicted vs. actual, wc would be left with 
a set of errors in the estimate which arc distributed around the regression line. 
We car. then use our knowledge of z scores and the normal distribution to deter¬ 
mine how much confidence we can place in our predictions of Y, based on these 
"errors." You remember that 68% of a normal distribution is + I s ; 95% within 
± 1.96 s, and ± 2.58 s covers 99% of the distribution. Since SEE is the "standard 
deviation" for regression, we can use these figures to determine our confidence. 

For a Y of 11.98 on the error detection test, wc can be 95% sure the actual score 
is between 9.33 (11.98 - 2.65 = 9.33) and 14.63 (11.98 + 2.65 = 14.63). Simi¬ 
larly. we d be 99% sure the actual score would be between 8.50 (11.98 - 3.48) 
and 15.46 (11.98 + 3.48). This shows us that our predicted scores are somewhat 
close to actual scores. 

SEE is important because it gives an indication of how large the error is in esti¬ 
mating scores for one test on the basis of scores from another. We also know that 
if the correlation between the scores for the two tests is low to begin with, the 
error will be high. That is, we don't increase the accuracy much above that which 
we would have by simply using the mean as the best estimate. When this hap¬ 
pens, there is no point in using one test to predict scores on a second measure. 
The two tests are simply too dilTcrent to allow for successful prediction. 


ooooooooooooooooooooooooooooooooooooo 

Practice 16.2 

► 1. For the TOEFL/TSE problem on page 472, calculate the standard error of 
estimate (SEE). s y - 40 and r^ = .80. Flow do you interpret this value? _ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Simple linear regression, then, provides us a way of predicting a S's performance 
on one measure based on his or her performance on another measure. It is an 
especially useful procedure for student placement. (Some researchers use th* 
method to supply scores for missing data. When balanced designs are required 
and data are missing, we may not want to delete a S for whom we have most 
measures but who is missing one score. Rather than use the mean as a neutral 
figure for the missing data, some researchers use regression to give a better esti¬ 
mate of the predicted score. Again, we are not recommending this procedure but 
simply saying this is one use of regression. Most computer programs allow for the 
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entry of missing data as missing and rework die computation aiound that infm- 
mation.) 

Simple regression has rather limited uses in our field. Multiple regression, on the 
other hand, is very widely used, and it wot ks on the same principles of prediction. 


Multiple Regression 

In simple regression we use the score on one variable to predict a second. Our 
ability to do this successfully depends on how closely the two measures arc re¬ 
lated. In our field, we usually use multiple regression when we want to discover 
how well we can predict scores on a dependent variable from those of two or more 
independent variables. With this method we can determine which of all the in¬ 
dependent variables best predicts performance on the dependent variable. Or. it 
will allow us to see what combination of variables we need ir. order to predict 
performance on the dependent variable. 

For example, we might want to know what combination of subtests best predict 
scores on a total placement test. If we know that some subtests give little addi¬ 
tional information, we might be able to drop some of these, making the test less 
time-consuming for everyone. Another example might be attempting to see what 
particular qualities or combination of qualities lead to successful language learn¬ 
ing as measured by an overall achievement test. Perhaps we want to look at all 
the "cognitive" values associated with successful learning-field independence, 
tolerance of ambiguity, intrinsic motivation, and so forth. Multiple regression 
would allow us to see which of thesc-or which combination of these-best pre¬ 
dicts performance on a test of "successful learning." 


Restrictions Related to Multiple Regression 

Some words of caution are in order before undertaking analyses using multiple 
regression. First, as a general rule, multiple regression requires that variables be 
interval or truly continuous and that the relationship be linear in nature. That 
is, the same general rules apply as for a Pearson correlation. This is not an ab¬ 
solute law, however. There are ways in which category variables (such as method 
or sex or semester level) can be entered into regression through the use of special 
procedures. (If you wish to do this, please consult with your adviser or statistical 
consultant on how this is to be accomplished.) 

Second, since the procedure builds on correlation, it is doubly important that the 
correlation values be accurate. We know that the reliability of measurement of 
the individual variables may not be equivalent. The reliability of each measure 
must be reported and reliability held constant in measuring correlation (using 
correction for attenuation). 

Third, if the variables entered in the regression formula are highly intcrcorrelated 
(i.c\, the correlation among some of the independent variables is in the high .SOs 
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or .90s), there will be a problem in using multiple regression. This is referred to 
as multicolinearity. (Another big word for your collection!) Consult with an ex¬ 
pert if this is the case. You may want to omit some of the variables or collapse 
them to take care of this problem. 

Fourth, remember that the more variables we put into the regression equation, 
the larger the N size for the study must be. As a rough rule of thumb, we want 
to have 30 5s for each independent variable. So a study where multiple re¬ 
gression includes, say six variables, would need an N of approximately 180 5s. 
Some researchers argue that multiple regression should have a minimum of 200 
5s. In any case, this is not a small /7-sizc procedure. 

Fifth, if the procedure is being used for inferential purposes rather than as a de¬ 
scriptive tool, the sample must be drawn at random, normal distribution and 
equal variances must be found, and the regression relationship must be linear 
(not curvilinear). 


Interpreting Multiple Regression 

The principles underlying multiple regression are the same as those of simple re¬ 
gression. Since the mathematical formulation of the procedure is beyond the 
scope of this book, we will only discuss the practical aspects of the procedure and 
interpretation of tables here. We refer you to Kerlinger and Pedhazur (1973) for 
more information on this procedure. 

In multiple regression, more than one independent variable is used to improve 
prediction of performance on the dependent variable. As an example, let's turn 
to Call's research on listening comprehension (1985). The study assesses the 
contribution of Five types of auditory input to predicting scores on a standard 
listening comprehension test (Michigan Test of Aural Comprehension ). The first 
type tested memory where a discourse context was supplied. 5s heard a brief 
story which was interrupted. At the interruption, a "probe word" (the first con¬ 
tent word in the last-heard sentence) was given and 5s were to repeat what they 
had heard after that word. A second test removed discourse context as a memory 
aid. 5s repeated individual sentences. The third task removed the element of 
syntax as a memory aid. 5s were required to repeat strings of content words ar¬ 
ranged in random order. Next, the lexical aid to memory was removed and 5s 
repeated strings of random digits (from four to eight digits in length). This was 
a test of symbolic memory. Finally, the tone memory section of the Seashore Test 
was used. It was predicted that the first three subtests would explain the greatest 
amount of variance in listening comprehension scores. 

The motivation behind this study was the description of "comprehensible input" 
as being comprised of material that is familiar to the student (the i of "/ + 1") 
and a certain amount of unfamiliar material whose meaning can be induced from 
the context (the " + 1" of "i + I"). According to Krashen, Terrell, Ehrman, and 
Herzog (1984), when students are presented with such input, they make use of 
"key vocabulary items (nouns, verbs, adjectives, and sometimes adverbs)" that are 
familiar to them in order to understand the global meaning carried by the input. 
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Context (linguistic and nonlinguistic) clarifies the meanings of familiar words. 
Comprehensible input is therefore defined in terms of context and content word 
vocabulary; the contribution of syntax is not given much weight. The subtests 
of Call's study were meant to show which kinds of memory best predict success 
in listening comprehension. Since the tasks vary context, syntax, and other 
memory components, the results could serve as indirect evidence in support or 
nonsupport of the Krashcn et al. hypothesis. 

The correlations show that each of the tests is related to the listening compre¬ 
hension test. 


Correlations of Tests with Listening Comprehension 


Discourse 

Syntax 

Words 

Digits 

Tone 

List.Comp. 

Discourse 1.00 

.78 

.56 

.42 

.25 

.57 

Syntax 

1.00 

.66 

.50 

.38 

.65 

Words 


1.00 

.46 

.06 

.39 

Digits 



1.00 

.22 

.34 

Tone 




1.00 

.42 


However, it also shows that each of the memory tests correlates to some extent 
with every other test. This is to be expected since all arc components of auditory 
memory. However, we do not need to worry about colinearity-e.g., that any of 
these correlations among the independent variables make it impossible to inter¬ 
pret the multiple regression—because the correlations are all below .80. 

For simplicity's sake, let's just look at the first three memory subtests—discourse 
context, syntax, and words. Imagine that we wanted to run a multiple regression 
to find out how well these three predict scores on the listening comprehension 
test. Each correlates with the test. But each also correlates with the other two 
subtests. In the following diagram you can see the overlap for each with listening 
comprehension. The central shaded area shows that much of that overlap is 
shared by all three subtests. The lightly shaded areas show the unique 
contribution-the unique overlap of each variable with listening comprehension. 
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Now, you can see that an important decision must be made. If we enter discourse 
context into the regression first, it will take the central shaded area of overlap 
with it, and the two areas of overlap from syntax and words, leaving syntax and 
words with much diminished importance. If we enter syntax next, it takes its 
unique contribution and the area of overlap with words. Words is entered third 
and now consists only of the unique area which docs not overlap with syntax or 
discourse. Whichever variable is entered into the regression equation first will 
always have a much higher contribution than those entered after it. 


Var 1 = D 



Var 2 = S 



Interpreting the results in such a way that the other variables, those entered after 
it, have little importance is obviously unwarranted. Look at the correlation table. 
If we enter syntax first, we can account for .65 2 or 42% of the variance. If we 
enter discourse first, we can account for .57 2 or 32% of the variance in listening 
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comprehension. If we entered syntax first and. thus account for 42% of the var¬ 
iance in listening comprehension, and then enter discourse next, discourse no 
longer will account for 32% of the variance. Instead, the analysis will check to 
determine how much "unique" overlap with listening comprehension is left sepa¬ 
rated from all other correlations. Indeed, very little will be left, and so this second 
variable adds little information in predicting listening comprehension scores. 
Does this mean that the second variable is not important? No, it does not. 
Rather, we know that there is a strong overlap between the role of context in 
sentence syntax memory and discourse memory. Both are important and together 
they allow us to account for much of the variation in listening comprehension. 
However, the sentence syntax memory task, since it correlates with the discourse 
task, gives us a good predictor of scores on the listening comprehension test. 

In running a multiple regression on the computer, you can leave the choice of 
which variable to enter first up to the computer. If you do, it will automatically 
select the variable with the strongest correlation with the dependent variable first. 
Then, it will add in the independent variable that will add most in terms of 
unique explained variance next, and so forth. Sometimes this doesn't make sense 
and you will want to tell the computer the order in which to enter the variables. 
(For example, in predicting placement test scores, you might want to enter certain 
variables first not because they have the strongest correlation' but because they 
are those which are usually most accessible. If these variables serve as good pre¬ 
dictors, then those which are less often available could be omitted in future work. 
In such cases, you are not testing theoretical issues but rather practical issues). 
When you (rather than the computer) determine the order of entry, justification 
must be given for the choice. 

Now let's look at the table from Call's study. For an interpretation and dis¬ 
cussion of the results, please consult the study. 


Sub test 

Simple r 

R 2 

Change in R 2 

Syntax 

.65 

.42 

.42 

Discourse 

.57 

.43 

.01 

Tone 

.42 

.47 

.04 

Words 

.39 

.47 

.00 

Digits 

.34 

.47 

.00 


(The variables were entered into the correlation equation in descending order, 
from the highest simple correlation to the lowest.) 

You should have no difficulty in understanding the column labeled "simple r"~ 
this is the correlation of each subtest with listening comprehension. The R 2 col¬ 
umn shows the amount of overlap accounted for by sentence syntax (this is .65 2 
or .42). That is, the two measures share 42% variance. It is a good predictor to 
use in estimating listening comprehension scores. The next value in the R 2 col¬ 
umn is .43. This says that the first and second variable together (i.e., syntax plus 
discourse memory tasks) have a 43% overlap with listening comprehension. No¬ 
tice that this is only a 1% gain in prediction. When the third variable, tone 
memory , is added to the equation, the three variables now account for 47% of the 
variance in listening comprehension. Another 4% has been added in terms of 
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prediction of scores. Notice that adding the last two variables brings about no 
change. They add nothing unique to prediction. It's not that they are not im¬ 
portant, but that their contribution has already been made by the previously en¬ 
tered variables. 

Interestingly, Call presents a second table, showing what happens if the variables 
are entered in ascending order from the lowest to the highest simple correlations. 


Subtest 

Simple r 

R 2 

Change in R 2 

Digits 

.34 

.11 

.11 

Words 

.39 

.18 

.07 

Tone 

.42 

.32 

.14 

Discourse 

.57 

.42 

.10 

Syntax 

.65 

.47 

.05 


The results do not differ in terms of the R 2 column. That is, the five subtests still 
account for .47 of the variance. With this order of entry, however, all five sub- 
tests are needed to account for information shown by three subtests in the previ¬ 
ous table. 

In the first table, you can sec that, using three of the independent variables, we 
have accounted for 47% of the variation in scores on the dependent variable, 
listening comprehension. This leaves 53% of the variance still unaccounted for. 
The digit and word variables did not increase the amount of variance accounted 
for in listening comprehension. Other variables must be found before we can in¬ 
crease R 2 . 

Interestingly, analysis of variance and multiple regression are probably the two 
most used statistical procedures in applied linguistics research. This is because 
we so often want to investigate the relationship of many different independent 
variables with the dependent variable. Our designs are often amazingly complex. 

As tve have noted before, ANOVA and multiple regression can be used for either 
descriptive or inferential purposes. When they are used for inferential purposes, 
certain assumptions absolutely must be met. We have specified these in this 
chapter for multiple regression. However, even for descriptive purposes there are 
warnings that must be observed. In reviewing studies using multiple regression 
(as with those using Factorial ANOVA), it has been difficult to find research re¬ 
ports where all the assumptions have been met. N size is often a problem; the 
assumptions of Pearson have not been met. One point many of us forget is the 
necessity of correction for the effect of different test reliabilities ("correction for 
attenuation")- Researchers have a habit of saying "everyone violates assumptions 
of statistical tests, and so it's a matter of where you decide to draw the line— 
which will you decide are okay to overlook and which will you decide are too 
important to ignore." While that indeed may be the case, it is our contention that 
when we use a statistical procedure for either descriptive or inferential purposes, 
we state whether we have violated any of the assumptions that underlie the 
computation and then justify our use of the procedure. When we simply ignore 
assumptions, readers either do not know we have done so, wonder why we have 
done so, or wonder what difference it might have made in our results. If the 
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purpose of using a statistical procedure is to give us (and our readers) confidence 
in our findings, then appropriate choice of procedures and appropriate interpre¬ 
tation of results of the procedure are of importance. 


Relation Between A NO FA and Regression 

There is a strong resemblance between ANOVA and Regression. In ANOVA, 
we try to account for the variance in a dependent variable on the basis of two 
major components: the variance between groups (which includes the treatment 
effect and error) and the variance within groups (error only). We use sums of 
squares and the formula SST = SSB + SSW to show how the variance can be 
partialcd out to these two major components. 

In regression analysis, we can conceive of the sum of squares for the predicted 
value of Y as the sum of squares regression (the predicted variation) and the 
leftover variation as sum of squares residual (which is the variance left unac¬ 
counted for). So, in regression, SST= SS reg + SS res . 

In ANOVA, we formed an observed F ratio using the formula 
SSB^df DeM MSB 

SSW^df WiMn MSW 

In regression, we also can form an observed F statistic using the formula 

SS reg ~ tfreg 

F =- 

S$res ~ 4fres 

The df reg = K (number of groups) and df res = N — K— 1. 

Thus, both procedures are dealing with variance in the dependent variable. Both 
try to account for as much variance as possible as an "effect of" (ANOVA) or as 
"accounted for by" (regression) various independent variables. Understanding 
the similarity between these procedures may help you in deciding which (if, in¬ 
deed, either) might be the best route to take in your own research projects. 


Activities 

Each of the following studies employed multiple regression in analyzing data. 
For the article that most interests you, assume you were asked to critique the ar¬ 
ticle prior to its publication. Note whether the procedure is used for descriptive 
or for inferential purposes. Evaluate the appropriateness of. the procedure for the 
study and critique the interpretation given the results of the procedure. As an 
evaluator, what suggestions would you make to the author(s)? 

1. M. E. Call (1985. Auditory short-term memory, listening comprehension, and 
the input hypothesis. TESOL Quarterly, 19, 4, 765-781.) tested 41 ESL students 
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to assess the role of five short-term memory components (sentences in context, 
isolated sentences, random words, random digits, and random tones) of listening 
comprehension. Memory for syntax, specifically sentences in isolation or sen¬ 
tences in context, proved to be the best predictor of successful listening compre¬ 
hension. 

2. C. Chapelic & C. Roberts (1986. Ambiguity tolerance and field independence 
as predictors of proficiency in English as a second language. Language Learning, 
36, 1, 27-45.) administered a variety of tests and attitude measures to 48 ESL 
learners from two language groups at three times during the semester. Field in¬ 
dependence was measured by the Group Embedded Figures Test, and ambiguity 
tolerance was tested by the MAT-50, a Likcrt-scale measure. Proficiency was 
assessed by the TOEFL, grammar, dictation, cloze, and oral tests. Results 
showed that field independence and ambiguity tolerance explained a significant 
amount of variance in posttest proficiency measures. The authors concluded that 
good language learners may be, among other things, field independent and toler¬ 
ant of ambiguity. 

3. B. S. K. Morris & L. J. Gerstman (1986. Age contrasts in the learning of 
language-relevant materials: some challenges to critical period hypotheses. Lan¬ 
guage Learning, 36, 3, 311-352.) studied 182 children in three public school 
grades (4th, 7-8th, and 11th) who were given a lesson in Hawaiian (a language 
none of the Ss had any contact with) and were then immediately given a test to 
measure semantic and syntactic content retention. Another test was given one 
week, later. (Results from an ANOVA showed grade 4 students had the poorest 
retention, and grade 7-8 students did better than the 1 lth-gradcrs on some tasks.) 
Among the demographic variables studied (sex, reading level, attitudes, 
socioeconomic status), reading level proved to be the best predictor of immediate 
retention. The best predictor of retention after one week was the immediate re¬ 
tention test score. The authors concluded that the capacity to learn a language 
cannot be solely related to age. 

4. C. Chapelle & J. Jamieson (1986. Computer-assisted language learning as a 
predictor of success in acquiring ESL. TESOL Quarterly, 20, 1, 27-46.) reported 
on 20 Arabic and 28 Spanish speakers who used lessons from several series of 
CALL lessons outside ESL classes during the course of a semester. Data gath¬ 
ered on the S's included time spent using CALL materials, attitudes towards 
CALL lessons, field independence, tolerance of ambiguity, motivation intensity, 
English class anxiety, TOEFL scores, and scores on an oral test of communicative 
competence. The study showed that motivational intensity and field independ¬ 
ence accounted for a significant amount of the variance in time spent using 
CALL and attitude towards CALL. One would expect motivated students to 
have positive attitudes towards, and spend time on, many academic activities, 
including CALL. But field independence accounted for variance in lime and at¬ 
titude even after motivation had been entered into the regression. 

5. M. Zeidner (1987. A comparison of ethnic, sex, and age bias in the predictive 
validity of English language aptitude tests. Language Testing, 4, 1, 55-71.) pre¬ 
sents a series of simple regression analyses aimed at discovering possible bias in 
an English language aptitude test which is part of a college entrance exam in 
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Israel. "Bias" in this study was operationally defined as error in the predictive 
validity of the English test, for three population subgroups, f irst year GPA was 
predicted by the English test for three subgroups of 824 5s: males and females. 
Western and Oriental students, and 5s in four age groups: 18-21. 22-25, 26-29, 
and 30 -t . The analyses show that first-year GPAs arc overpredicted for males 
and Orientals while they are underprcdicted for females and Western students. 
Test scores predicted GPA scores less well for the oldest age group (30 + ) than 
for the other age categories. However, since all errors in prediction were modest, 
the author concludes that the results do not support contentions by some that 
universities should use different criteria for selecting members of subgroups in the 
population of applicants. 
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Chapter 17 

Other Statistical Tests Used in 
Applied Linguistics 


• Principal component analysis and factor analysis 

Interpreting components or factors 

• Multidirnensional scaling 
•Path analysis 

*Loglinear and categorical modeling 

Categorical modeling and causal analysis 
Categorical modeling with repeated-measures 

As you read through the research literature in applied linguistics, you may find 
statistical procedures which have not been presented in this book. We have tried 
to include those that are most used in the field. This means that some procedures 
not often used or those that are alternatives to procedures already presented will 
not be discussed here. 

There are, however, other very useful procedures that ought to be included in any 
book for researchers or consumers of research in applied linguistics. These are 
important because they hold promise of helping us better understand language 
learning in all its complexity. 

In the first section of this manual, we asked that you narrow the scope of your 
research so that it would be feasible. Often this means looking at the relation 
between just two variables. When we do this, we inevitably think of a third var¬ 
iable that may play a role or may influence the relation between the original two 
variables. That is, our expertise in the field of language learning tells us that in 
the "real world" the picture is much more complex-that we must add more pieces 
to the puzzle if wc want to capture the essence of language learning. As we add 
more variables, the research becomes increasingly complcx-sometimes so com¬ 
plex that we need to reduce large numbers of variables in some meaningful way. 
Our picture of the "real world" may also suggest that those variables discovered 
to be important in language learning should be arranged in a particular way. 
And so, models are proposed (for some reason, these untested models are some¬ 
times called theories) for the learning process. The statistical procedures included 
in this chapter are important because they give us a way of discovering factors 
that underlie language proficiency (and, hopefully, language learning) and ways 
of testing the relationships among them. 

By and large, these procedures require relatively large numbers of As or observa¬ 
tions, require some mathematical training to understand, and can only realis- 
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tically be carried out using computer programs. While most of our research 
involves small n sizes, there are times when the data are sufficient for appropriate 
use of these procedures. Our goal will be to give you a brief introduction to each 
without much attention to the statistical details. These introductions are heuristic 
in nature. Since they cover complex statistical procedures, they will, of necessity, 
be misleading if taken at face value. We therefore strongly urge that you read the 
listed references and talk with a statistical consultant before selecting any of the 
procedures presented in this chapter (and that you ask for assistance in selecting 
the best computer programs). 

The first group of proccdurcs-principal component and factor analysis, multidi¬ 
mensional scaling, and path analysis-all relate to correlation or to 
variancc/'covariance. With the exception of interitem correlations of language 
tests, the data arc continuous (interval) in nature. The second group-loglincar 
procedures—relate to such distributions as those typically presented in Chi-square 
analyses. The data for this second group consist of frequencies of nominal 
(categorical) variables. 


Principal Component Analysis and Factor Analysis 

Principal component analysis and factor analysis are techniques used to deter¬ 
mine whether it is possible to reduce a large number of variables to one or more 
values that will still let us reproduce the information found in the original vari¬ 
ables. These new values are called components or factors. They do not exist on 
the surface of the observed data but they can be captured by these analytic 
techniques. 

The task undertaken in each analysis is that of extracting what is common among; 
various tests and thereby reducing the number of tests (or variables). For exam¬ 
ple, you might administer several general achievement tests to large numbers of 
5s. The tests purport to be general achievement measures, but you wonder 
whether they aren't, perhaps, testing several different things. If you run a correr. 
lation of the test results, you might find that some correlate highly with each 
other while others may not be highly correlated. Certainly if there are subtests, 
you would expect that the tests which measure reading must be measuring 
something different than those which measure, say, spelling abilities or, perhaps, 
functional communication skills. If a large number of tests are available, we 
would like to know (statistically) whether there are components that are shared 
in common by the tests and whether we can capture them in a meaningful way. 
And we would like to know just how many different components actually exist 
that underlie all the tests (or variables). 

Consider another example where you have data from many different tests for a 
group of 5s. Say that you surveyed the literature and decided that there are 
many independent variables that might influence language achievement (as mea¬ 
sured by a proficiency test). You had planned, perhaps, to do a multiple re¬ 
gression to find out which of these many variables can best predict language 
achievement. Imagine you have scores on field dependence/independence, toler-; 


490 The Research Manual 




a rice of ambiguity, introversion-extroversion, short-term memory, 
extrinsic intrinsic motivation, integrative-instrumental motivation, years of lan¬ 
guage study, a cultural attitude measure, an acculturation measure, a role values 
measure, a cultural contact measure, a metalinguistic awareness test, a self- 
concept test, the Peabody Picture Vocabulary Test, a self-report on degree of 
bilingualism, an impulsivity test, and so forth. Each of these might or might not 
be an important predictor of language achievement. There arc two potential 
problems to be faced at once. First, we know that the correlation of these tests 
is sensitive to the relative reliability of each test. So, prior to running the proce¬ 
dure, you will want to check test reliability and, where appropriate, correct the 
correlation values (correction for attenuation). A second potential problem is 
sometimes called multicolinearity. If many of the measures are highly inlcrcor- 
related (multicolinear), it may be impossible to carry out your plan. Because 
there arc so many variables to look at, you might begin to wonder if some of these 
measures might be collapsed into just a few factors, perhaps those that measure 
acculturation, those that seem to be personal need-achievement tests, and those 
which are related to memory. Aside from logic, there is no way to collapse vari¬ 
ables into a smaller number of variables in a meaningful way unless we use an 
analysis that searches for these factors. The analysis will use statistical proce¬ 
dures (the researcher later supplies logic) to discover these shared, underlying 
factors. 


Principal Component Analysis (PCA) 

Principal component analysis is used to discover components that under.ie 
performance on a group of variables. It is best utilized as a precursor to factor 
analysis. Why this is so will be explained as wc discuss these two procedures. 

Depending on the software program used, PCA starts with either a correlation 
matrix of all the tests or with a covariance matrix of the tests. The procedure 
searches through the data of all the tests to find a first principal component, l! 
will produce a value for this first component. This first component will explain 
as much of the total variability in the original data as possible. The value 
produced is called an eigenvalue, a weight of sorts, for the first component. 

The procedure then searches through all the tests for a second component that is 
not correlated with the first component. This component, again, should account 
for as much of the total remaining variance in the data as possible. And, again, 
this will be isolated and presented as a value for the second principal component. 

The procedure will then search for a third component that is uncorrelated with 
the first two components and present a coefficient for this component. The 
process of exliacting components from the matrix (whether it is a correlation or 
a covariance matrix) will continue until as many components as there arc tests (or 
variables) have been extracted. 

Once the components are extracted, the researcher looks to see how much of the 
variance has been accounted for by the first component, the second component, 
and so forth. To do this the researcher compares the relative magnitude of each 
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eigenvalue to each successive eigenvalue. If one or two or three components ac¬ 
count for most of the information in all the tests, then we can talk about the tests 
as measuring these few components (rather than the many original variables). 

If the new components capture most of the information in the original data, then 
we have succeeded in reducing a large number of variables to a few, hopefully 
meaningful, components. These components indicate the possible number of 
factors to extract in a subsequent factor analysis. 


Factor Analysts 

Following upon PCA, the goal of factor analysis is to discover factors that 
underlie a series of items or tests that measure many variables. It docs this by 
decomposing the score variances to isolate common variance (common because 
it appears in more than one test) and looking for factors there. This is one of the 
basic differences between factor analysis and PCA. Factor analysis looks only 
at common variance, while PCA looks at all variance present in the data. 

Like PCA, factor analysts (there are several different types) usually begins with 
a correlation matrix, the correlations of all the tests. In extracting factors, the 
procedure examines the variance in the tests in terms of three components: 

1. Common variance—the variance that is shared because there is some under¬ 
lying factor or factors that appear in more than one test. 

2. Specific variance—the variance that is specific to a test, not shared with oth¬ 
ers. 

3. Error variance—ordinary sampling error. 

The common variance may divide into several factors or only one. In the example 
on page 490, we had many variables (field dependence, and so forth) that might 
relate to language achievement. The goal is to reduce all these to a small number 
of factors. These factors will be found in the common variance shared by the 
tests. In addition, the procedure will identify unique variance. This is specific 
variance which is not shared across tests plus ordinary error variance. The total 
variance, then, will be: align = center. 

y A — V + V + V 

r tot common spec ' err 

Total variance equals common variance + specific variance + error variance. 
The specific and error variance together are called unique variance. 

Common variance will be made up of a combination of one or more common 
underlying factors shared by all the tests. Therefore, 

Vtotal ~ ^A V n + Vspecific ^error 
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If the researcher has already used PCA, the range of possible factors would al¬ 
ready be known. Factor analysis can be used to confirm that information. If 
PCA and factor analysis are doing the same thing, how do they differ? Principle 
component analysis starts with a correlation matrix. Think for a minute what 
such a matrix looks like. Across the diagonal are a series of 1.00s. This is the 
correlation of each variable with itself, right? These are the diagonals used in 
principal component analysis. What happens in factor analysis is that these val¬ 
ues are replaced by the communalities, values for the common variance shared 
by the tests (or variables). Specific and error variance are removed. 
Communalities take the place of the 1.00s in the principal diagonal. 

The effect of this replacement (communalities for 1.00s in the diagonal) is that the 
matrix algebra of the factor analysis will only work with common variance. 
That's important-factor analysis seeks the parsimony of the best set of common 
factors. It does not also try to explain specific and error variance (as does PCA). 
This feature is central to the difference between PCA and factor analysis and is 
the primary motivation to use factor analysis in place of, or following, principal 
component analysis. 

In factor analysis, the matrix is next rotated to achieve "maximum parsimony." 
The type of rotation used is "oblique" if the factors are presupposed to be corre¬ 
lated and it is "orthogonal" if the factors are thought to be uncorrelated. (In 
language research, it is rather difficult to think of factors that would not be cor¬ 
related, so the type of rotation is more frequently oblique.) It is the result of this 
final step, the rotated factor analysis, which is presented in research reports. 
Printouts and journal tables usually list the number of factors found and addi¬ 
tional information on common variance and specific variance. The common 
variances represent the correlation coefficient of each test score with the under¬ 
lying factor(s). The portion of the total variance contributed to each factor is 
referred to as factor loading. Let's look at one such table and work through an 
interpretation here. 

Suppose that we had three tests—a general proficiency test for French, a test of 
spoken French, and a French reading test. Imagine that since there are three 
tests, you requested at least three factors. Depending on the computer program 
you use, at least some of the following information will be supplied. 


Test 

FI 

F2 

F3 

Spec 

Err 

Comm 

Tot 

Profic. 

.16 

.36 

.25 

.09 

.14 

.77 

1.00 

Spkn. Fr. 

.49 

.16 

.00 

.25 

.10 

.65 

1.00 

Rdg. 

.36 

.00 

.16 

.36 

.12 

.52 

1.00 


Look at the top row. The general proficiency test shows factor loadings for each 
of the three factors (.16, .36, .25). These three factors make up the common 
variance of .77 for the achievement test. The unique variance is .23 (.09 specific 
+ .14 error). That accounts for the total variance of the test. Now if you look 
just at the "specific variance" column, you will see that each test shows some 
specific variance. Of the three, the reading test is the one which seems to have 
the most specific variance (variance not linked to the three factors). The 
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achievement test has the least amount of specific variance. It has the highest 
common (shared) variance of the three tests. 

A loading of .30 or above is considered to be a substantial link of a factor and 
test. Let's look, then, at the loadings for factor 1. The test of spoken French and 
the reading test contribute most heavily to this factor; the general proficiency test 
loading is much less. Immediately we start to wonder what common ground there 
is between speaking and reading which is less common in general achievement. 
No answer springs ready-made to mind. Turning to factor 2, we see that the 
reading test contributes nothing to factor 2, and the speaking test adds little to 
the factor. Only the achievement test seems to contribute much to this factor. 
Again, what is there that an achievement test might measure which would not 
also appear in speaking and reading tests or, conversely, what isn't in speaking 
and reading but is present in achievement tests? Factor 3 shows very little. The 
speaking test contributes nothing to the factor and the loadings from the other 
two tests are slight. If we look at the factors themselves, it should be clear that 
factor 3 has low factor loadings from the three tests; it could be safely dropped 
from the analysis. Eigenvalues, supplied in the printout, should confirm this. 
Actually, a one-factor solution seems likely. We are, however, still left with the 
problem of identifying at least factor I and, depending on information from the 
eigenvalues in this analysis, perhaps, factor 2. 


Interpreting Components or Factors 

As you might guess, the hard part is that after detecting underlying factors from 
factor analysis, we need to find an appropriate label for each. In the example on 
page 490, perhaps cultural contact, acculturation, integrative motivation, and 
cultural role values all load on one factor which can conveniently be labeled 
something like cultural factor. 

It's possible, however, that introversion/extroversion, the Peabody vocabulary 
test, and short-term memory would load on one factor. Imagine nothing else 
loads on this factor. What label could we attach to such a factor? Or, perhaps, 
self-concept is included in a factor that has high loadings for short-term memory, 
field dependence/'independence, and metalinguistic awareness (and low loadings 
from all the other test instruments). Again, we would be better able to find a 
cover term for this factor if the contribution of self-concept were not there. We 
always hope that we will be able to interpret the factors in a reasonable way. 
Since it is often difficult to name the underlying traits accurately, researchers of¬ 
ten leave the categories as "factor A" and "factor B" and allow the consumer of 
the research determine the meaning of the factor. 

Let's turn to another example, and see what real loadings sometimes look like and 
the difficulty in interpretation researchers face. Jafarpur (1987) demonstrated, 
in a variety of ways, the reliability and validity of the short context technique 
(SCT) for testing reading. The technique tests reading using passages of one, two, 
or three sentences in length with one or two questions on overall meaning of each 
passage. The study includes the correlations of the test with various subtests of 
the Michigan Test of English Proficiency (Upshur, Corrigan, Dobson, Spaan, and 
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Strowe, 1975) and the ALIGU test (Harris, 1965) as showing concurrent validity. 
Finally, he produces a table showing a factor analysis of the SCT (short context 
technique test), the grammar, vocabulary, and reading subtests of the Michigan 
battery, and a cloze test for sample group 1 (91 5s in the M.S./Education pro¬ 
gram at Shiraz University). 


Varimax Rotated Factor Matrix-Group 1 


Test 

A 

D 

C 

h* 

MTELP-Gr 

.43 

.41 

.65 

.78 

MTELP-V 

.13 

.63 

.21 

.46 

MTELP-Rd 

.69 

.14 

.37 

.63 

SCT 

.63 

.58 

.29 

.81 

Cloze 1 

.40 

.30 

.56 

.57 

V 

2.82 

.29 

.14 


% 

87.00 

8.90 

4.20 



h 2 is the common factor variance. If you look at the bottom of the table, you 
will see that the first factor accounts for 87% of the total variance in the tests, the 
second for 8.9% and the third for 4.2%. Obviously the First factor is the major 
component underlying the tests. You can tell this by comparing the successive 
eigenvalues in the row labeled V (2.82 — .29 — .14). In this case, the researcher 
might opt for a one-factor solution and talk about the unidimensionality of these 
tests. However, let's look at each factor in turn to see what information they can 
give us. If you look at the numbers under factor A, you will see that the grammar 
subtest contributes a loading of .43 to the factor, vocabulary contributes .13, the 
reading test of the Michigan battery is .69, the SCT is .63, and Cloze 1 is .40. 
By convention a loading of .30 or above is considered to signal that the test does 
load on the factor. So the two reading tests, the cloze, and the grammar test 
group together on this factor while the loading from vocabulary is low. Since the 
SCT is a reading test and loads on factor A along with the Michigan reading and 
grammar tests, and a cloze test (which might also measure, in part, whatever 
reading tests measure), calling this a reading factor seems to make some sense. 
Cloze is something of a predict-as-you-read process. Still, it doesn't really make 
sense that vocabulary should not contribute as much to the factor as grammar 
(since reading research has shown vocabulary, particularly, and grammar, less so, 
are related to reading proficiency). If we look at the SCT row, we see that the test 
also contributes to factor B. This second factor links vocabulary, the SCT read¬ 
ing test and the Michigan grammar test. This would not appear strange if the 
Michigan reading test were also included. Then perhaps a more vocabulary + 
reading connection could be made. As it is, the factor is difficult to interpret. 
Perhaps grammar and vocabulary are more important with short passages such 
as those used in SCT than they are for the longer passages in the Michigan 
reading test. SCT, in part, measures something that grammar and vocabulary 
also measure. 

SCT does not contribute to the third factor. Grammar and the cloze test do. 
The factor, however, accounts for very little of the variance in the data (.14) and 
could, possibly, be due to normal sampling error. From the eigenvalues, it ap- 
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pcais lhal the tcsLs aic unidimensional—that is, inc common variance need not 
be broken down into three factors; one will do. 

The above discussion should not detract from the findings of this study. In fact, 
if a unidimensionai solution is accepted. SCT fits well—it shares much of its var¬ 
iance with these other tests. The study includes a variety of procedures and dis¬ 
plays that give confidence regarding the use of SCT as an alternative approach 
for testing reading. 

The discussion above results from a procedure that we have urged you to adopt 
in reviewing research articles. As always, when you read a research article, begin 
with the abstract, then turn to the tables. Try to interpret the factors by looking 
at the factor loadings. Once you are satisfied with your own labels for the factors, 
read the study, comparing vour perceptions with those of the author. Test your 
ability to identify and label factors in the following practice. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 17.1 

1. Svanes (1987) administered a questionnaire to 167 foreign students enrolled 
in special Norwegian classes in order to study their motivation for studying 
Norwegian. A factor analysis of questionnaire responses was used to reduce the 
20 items. The table lists three factors obtained from the factor analysis with 
varimax rotation. 


Factor Loadings for 20 Motivation Variables 


Motivation Variable 

Reasons for Coming to Norway 

Factor 1 

Factor 2 

Factor 3 

Seeing Norway, scenery 

.73 

-.140 

-.13 

Getting to know Norwegians 

.80 

-.070 

-.06 

Getting a degree 

-.06 

.800 

-.21 

Finding out how people live 

.81 

-.200 

.04 

Study in Norway 

.07 

.740 

-.27 

Get training in field 

.12 

.470 

-.04 

Chance to live in another country 

.70 

-.190 

.03 

Find how students live 

.58 

.160 

.04 

Joining family members 

.03 

-.260 

.26 

Have new experiences 

.67 

.020 

-.14 

Meet different kinds of people 

.70 

.090 

-.05 

Fleeing from my country 

-.28 

.060 

.45 
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Reasons for Studying Norwegian 


To be able to study at the university 

.04 

.068 

.16 

Interest in Norwegian culture 

.62 

-.051 

.33 

Interest in Norwegian language 

.38 

.004 

.26 

To get a job in Norway 

-.07 

-.140 

.67 

To begin to think and 

.28 

.150 

.38 

behave as Norwegians 

To get a good job in home country 

-.05 

.570 

.09 

To establish better 

.46 

.010 

.44 

relations with Norwegians 

To get an education in order 

-.17 

.790 

.17 


to serve my country 


First circle all loadings greater than .30. Within each factor, compare the de¬ 
scriptions of variables above .30 with those below .30. Then decide on a label for 
each. Svanes labeled the first two factors as integrative and instrumental moti¬ 
vation, respectively. Do you agree with this classification? Why (not)? 


Can you think of a label for the third factor? 


Look at each of the three factors. Since the heaviest loadings are on the first 
factor, on what grounds would you argue for retaining a second or third factor? 
What additional information would you need? 


(Authors usually do include the h 2 and eigenvalues in tables. Sometimes such 
details are deleted to make tables simpler for the general reader, resulting in ta¬ 
bles that are difficult for researchers to interpret accurately. All authors face this 
dilemma-how to give all important information and not, at the same time, over¬ 
load the reader with detail.) 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Principal component analysis, followed by factor analysis, are techniques that 
allow us to reduce a large number of tests (or test items) to a smaller number of 
variables (called factors or components) that will retain as much of the original 
information in the data as possible. The techniques are quite common in studies 
that include information from many different tests-or many items of a single test. 
Since the techniques begin with correlation or covariance matrices, the techniques 
are only appropriate when the assumptions that underlie these procedures have 
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been met. In some cases these include the notion of normal distribution. To ob¬ 
tain a normal distribution, a rule of thumb is that 30 observations per variable 
are needed. For this reason, factor and principal component analysis are really 
only appropriate for large-scale studies (determined by the number of variables 
but usually over 200 Ss). The problem in our field, of course, is that we are sel¬ 
dom able to obtain so much data. For example, to use an n of 30 per variable 
or test in the Svancs example, would require 30 x 20 = 600 Ss. Until some sort 
of research consortium is formed in applied linguistics, it is unlikely that any 
single researcher would have access to such numbers. When these procedures are 
used with smaller sample sizes, the interpretation of the findings must be con¬ 
servative, and the reader should be warned that the procedure has only been 
carried out as an exploratory measure and that the results of the exploration need 
to be replicated. 

For an overview of principal component analysis and factor analysis, see 
Kcrlingcr (1986) and Kim and Mueller (1978). 


Multidimensional Scaling 

Multidimensional scaling is a data rcduction/clustering procedure similar to fac¬ 
tor analysts. In addition to interval data, it can take rank-order (ordinal scale) 
data as its input, data such as respondents' ratings of the difference distance be¬ 
tween stimuli. For example, 5s might be asked to respond to a set id' questions 
that measure their attitudes towards particular aspects of an I.S1. program. 
Their responses would be on a scale of, say, 7 points. It is possible to view the 
correlations of the items as "similarities," where high correlations represent 
greater similarity, and the lower correlations show "dissimilarity." Ihe procedure 
will convert these similarities and dissimilarities to distances in space. 1 ests (used 
loosely here to mean tests or items or 5s or other observations) that show strong 
similarity will cluster together and those that are dissimilar will cluster together 
in space at an appropriate distance from the first cluster. This gives us one di¬ 
mension in space. Then the procedure goes back to the same items or tests and 
repeats the process for a second dimension and plots it in space. A third dimen¬ 
sion will give a three-dimensional grouping in space. 

To do all this, multidimensional scaling internally converts scales, scores, or cor¬ 
relations to diffcrence/distancc values. It models the distances as follows: it plots 
all the data points so that the plotted distance between each pair of points most 
closely approximates the observed distance, a value derived from the correlations. 

The researcher determines the number of dimensions to extract by allowing the 
program to run multidimensional scaling solutions with increasing numbers of 
dimensions and then checking to see which best fits the original data. The com¬ 
puter program gives a "stress" value. The lower the stress value, the better the 
match to the data. In recent work, researchers have been able to use special plots 
(something like the scree plots generated for factor analysis) to judge accurately 
the true number of dimensions in the data. Aside from the area of test analysis, 
however, this new plotting possibility has not yet been used. Researchers, in- 
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stead, have used the MDS solutions, the stress values, and common sense to de¬ 
termine the number of dimensions. 

One of the especially appealing features of MDS is that it can analyze noninterval 
ratings, e.g., sorts. For example, Kellerman (1978), in a study of lexical transfer, 
asked 5s to sort cards on which were written sentences containing the word break 
(actually breken). As 5s sort cards into piles, they are, in effect, determining the 
semantic similarities of meanings by the number of piles they use in sorting. The 
MDS procedure can use ratings or correlations as input and is much more so¬ 
phisticated in plotting the dimensions since, figuratively, the cards are being re¬ 
sorted in an iterative fashion to find the best number of dimensions. When MDS 
was employed for this experiment, the procedure showed a reasonable two- 
dimensional solution. The two major dimensions appeared to relate to 
eore/'noncore (since relabeled as unmarked/marked) and an 
imageable/nonimagcablc (or concrete/abstract) dimension. 


unmarked 


? 


4 

* ? 

12 

mnrrptp 

5 
f * 

abstract 

16 

• 15 

V • 8 

1.1 * 

13 

9 10 


17 


marked 


The numbers set in space show break in sentences with particular collocations 
(1 = waves, 2 = light rays, 3 = leg, 4=cup, 5 = man, 6 = heart, 7 = word, 8 = oath, 
9 = ceasefire, 10 = strike, ll=law, 12 = ice, 13 = game, 14 = record, 15 = voice, 
16 = fall, 17 = resistance). Since the horizontal axis is concrete/abstract, those 
numbers toward the left side of the line are the most concrete and those toward 
the right, the most abstract. Those numbers close to the top of the vertical line 
are most unmarked and those to the bottom are more marked meanings. This 
makes sense since 3, break a leg is concrete and imageablc compared with 11, 
breaking the law , and 4, break a cup is more corelike or unmarked in meaning 
than 13, using a game to break up an afternoon. Interestingly, this experiment 
also showed that the marked/unmarked dimension was important in determining 
/students' willingness to translate breken as break while the abstract/'concrete di- 
: mension had little effect. 
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As we have said, one particular advantage of multidimensional scaling is that it 
can take a variety of statistics as input. For example, while factor analysis is not 
appropriate with Spearman rank coefficients. MDS is appropriate because it can 
effectively analyze Spearman coefficients. This advantage of MDS over factor 
analysis is particularly relevant in our field when the data are not clearly interval 
in nature, and the researcher therefore elects to use the Spearman cocfficcnt. 

Usually, interpretable MDS solutions do not exceed four dimensions. Since the 
analysis is spatial, comments about a five-dimensional hyperspace would be an 
interpretation nightmare. When interpretab.e, MDS produces not loadings, but 
coordinates in Cartesian space (as shown in the above plots). The researcher ex¬ 
amines these resulting plots for neighboring or aligning groups of variables. 
Neighboring clusters or aligning groups are then interpreted in the same manner 
as factors in factor analysis; the researcher draws conclusions based on those 
clusterings, and labels the dimensions. 

In applied linguistics, MDS has also been used to analyze large-scale data from 
language tests. For examples of MDS with language measurement data, see: 
Davidson (1988), Hozayin (1987), and Oilman and Strieker (1988). 


ooooooooooooooooooooooooooooooooooooo 

Practice 17.2 

I. Read the summary of the Ijaz study in the activities section (page 523). The 
data of the semantic relatedness test were analyzed using MDS. The following 
figures show the three-dimension (contact, vertical, space) solution for the se¬ 
mantic space of on. 

VFRT1CAUTY (11) MOVEMENT (111) 


Cl* 


Ml- 


•EB 


Kruskal'5 Stress 
Squared Correlation 


.027 

.999 


•The dimensions are indicated beiween brackets. 


VERTICAUTY <II) 


The first dimension, contact was the most "salient"; the second most "salient" was 
vertical ; the third was motion. The letters placed on the dimensions refer to the 
five groups of 5s: EA (u = 87) and F.B (n = 3 3) are native speakers of English-the 
1} group have parents with a different native language. GI arc German immi¬ 
grants to Canada (n 45); ON arc German high school teachers of English; tJI 
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arc 50 Urdu speakers who immigrated to Canada; and MI are foreign students 
or imm grants from other language groups. Look at the placement of each group 
on each dimension. Do the similarities/differences in placements of the groups 
make sense to you? Which puzzle you? Compare your figure reading skills with 
those of others in your study group. Give your consensus on figure interpretation 
below:_ 


2. In order to arrive at names for the dimensions, the author used three criteria: 
(1) the nature of the most salient semantic dimension of each word as became 
apparent in a pilot study; (2) the nature of pairs which were rated significantly 
differently by the L2 and native speaker groups; and (3) the nature of group- 
specific erroneous lexical uses on the second test involving the different words. 
We know that naming factors and dimensions is not an easy task. This is one 
of the few studies that gives criteria for labeling. Have one member of your study 
group read the article and explain study details to you. Do you feel the criteria 
are sensible guidelines for labeling? Why (not)? 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

I he MDS procedure is used to determine the number of dimensions that exist in 
the data. As is the case in factor analysis, a decision must be made by the re¬ 
searcher as to which solution (in MDS, how many dimensions) to accept. Re¬ 
cently, computer programs have been used to generate visual plots that allow us 
to judge accurately the true number of dimensions. However, all MDS programs 
generate a "stress coefficient" value which can be helpful. This value reveals how 
far the solution differs from the data on which it is based. The lower the stress 
coefficient, the closer the match. Needless to say, the determination of an ac¬ 
ceptable stress range is arguable, and this means that both substantive and sta¬ 
tistical support must be given to the solution (number of dimensions) accepted. 

Finally. Davidson (1988) notes that MDS consistently underestimates the num¬ 
ber of factors from a factor analysis extraction. This is because geometrically the 
factor analysis model includes a point of origin, whereas MDS is more of an ar¬ 
bitrary frec-spacc plotting technique. This is not necessarily a weakness; i.e., 
factor extraction is not necessarily better because it provides more detail-more 
dimensions. There comes a point in all dimensional modeling where the re¬ 
searcher confronts over-large factor models, models which are uninterpretable 
under current theory. Davidson concludes that competing model extraction 
methodologies, like MDS and factor analysis, merely provide alternatives to each 
ot'ner-ncither, nor cither's metaphor of the world, is inherently correct. Mental 
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abilities, such as language proficiency, may be better served by a factor analytic 
metaphor, by a MDS analogue, or by some as yet unused or unforeseen statistical 
clustering technique. 

For overviews of MDS, see Coxon (1982) and Kruskal and Wish (1978). 


Path Analysis 

The intent of the preceding techniques has been to discover components or factors 
that underlie a group of tests or to discover how they group together to form 
clusters. In a very real sense, then, they arc exploratory procedures (although 
they may also be used to confirm the existence of factors). Path analysis is con¬ 
cerned with a very different goal. In this case, the researcher has identified some 
particular goal and the variables that relate to that goal. The analysis is meant 
to test whether, in fact, the variables not only predict performance on the final 
goal but explain that performance. That is, path analysis allows us to test causal 
claims. 

You will remember from the discussion of regression that it is possible to accu¬ 
rately predict performance on one variable from performance on another variable 
if the two variables are highly correlated. One of the caveats of correlation, 
however, is that no causal claims can be made about this relationship. That is, 
if you find that TOEFL scores are highiy correlated with scores on your 
homegrown proficiency examination, you can say they are related. If you use 
regression, you will be able to predict performance on one measure given perfor¬ 
mance on the other. Though you can predict performance with confidence, you 
cannot say that performance on your proficiency exam causes the predicted 
performance on the TOEFL. 

Path analysis is used to study the direct and indirect effects of a set of variables 
the researcher believes cause differences on a final goal, performance on the de¬ 
pendent variable. Because it looks for causal links among the variables, it is 
thought of as a theory-testing procedure rather than a discovery technique to 
generate theory. The researcher already has in mind which variables are impor¬ 
tant (perhaps on the basis of previous experiments set up to see how much vari¬ 
ation in final outcome can be produced by varying certain independent variables) 
and hopes that the procedure will confirm these beliefs or that the procedure will 
allow for some of the variables to be dropped, "trimmed" from the theory. 

There are a number of general assumptions in path analysis. 

1. The variables arc linear, additive, and causal. This means that the data cannot 
be curvilinear. Additivity means that the variables cannot be interactive (i.e., 
they should be independent of each other). And finally, logic should argue 
that there is a causal direction to the relationship. 

2. The data are normally distributed and the variances are equal. You know that 
one method of obtaining a normal distribution with equal variances is to in¬ 
crease the sample size. This is not a small-sample technique. 
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3. The residuals are not correlated with variables preceding them in the model nor 
among themselves. This implies that all relevant variables are included in the 
system. 

4. There is a one-way causal flow in the system. That is, the final goal (depend¬ 
ent variable) cannot be seen as causal for the preceding independent vari¬ 
ables. If your model is recursive, consult a statistician for help. 

5. The data are interval, continuous data. That is, the procedure is not appro¬ 
priate for nominal, categorical variables. 

On the basis of previous research or theoretical reasoning, a researcher sets up a 

path diagram such as the following. 



The connections shown in the figure imply that independent variables I and 2 
may be related to each other. Each may also be related to variable 3. Variable 
3 may be linked with the dependent variable 4. Variables 1 and 2 may also be 
linked with the dependent variable without the intermediary variable 3. The 
analysis will test the model to see if some of these "paths" can be eliminated. 

Perhaps this example will be clearer if we fill in the diagram. Imagine that we 
wish to explain language learning success. This is the dependent variable, and 
we have operationally defined language learning success in terms of performance 
on a respected language achievement test. The data are interval in nature. We 
have decided to test the "theory" that national language policy, parental aspi¬ 
rations, and the learner's own need achievement determine achievement. A score 
for national policy has been arrived at by converting a group of five 7-point scales 
to a final "score" which we believe is interval in nature. This is box I, the first 
independent variable. The data on parental aspirations also consists of a final 
score derived from a series of questionnaire scales. These data fill box 2, the 
| second independent variable. Ss' need achievement has been measured using a 
need-achievement test, again score data which is placed in box 3, the third inde- 
[ pendent variable. We have placed need achievement in this position because we 
, hypothesize that it is a mediating variable, mediating the effect of the first two 
variables. (We don't believe that the Ss' need achievement influences either na- 
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tional language policy or parental aspirations, but these variables may influence 
need achievement.) 



If this diagram is correct, we would expect that most of the effect of language 
policy ami parental aspiration would be mediated by the .Vs own need achieve¬ 
ment. Only a small portion of the effect of language policy and parental aspi¬ 
ration would directly impact on language achievement. If so, the "paths" bet ween 
box I and box 4 and between box 2 and box 4 could be deleted, and the theory 
revised. Perhaps even the path between box I and box 3 will be deleted, showing 
that the effect of national language policy is mediated through parental aspi¬ 
rations which, in turn, may influence 5s' need achievement and final language 
achievement. 

The "test" of this path diagram is accomplished by a procedure similar to re¬ 
gression. Regression would regress the single dependent variable (language 
achievement) on all the independent variables to see how many are really needed 
to account for most of the variance. Path analysis regresses each variable on all 
preceding variables. Thus, it traces direct and indirect effects (paths). The effect 
may be correlated so that the following pathway is shown: 
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The regression path may show that some or all of the effect of one variable is 
mediated by another: 



Or it may show that the effect of each variable on the following variable is inde¬ 
pendent. 
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While vve talk about path analysis as thenr> testing, it is really most useful in that 
it allows us to do theory trimming. 1 or example, wc might atguc that the im¬ 
portant variables that cause differences in performance in second language 
achievement arc proficiency in the LI, age, school quality, self-concept, level of 
aspiration, and acculturation. Assume that we have operationally defined each 
of these factors in a reasonable way and that we have ordered them (on the basis 
of previous research or our own intuitions) as follows: 



Wc hope that the analysis will show that some of the connecting lines are strong 
links and that others are weak enough that wc can delete them. The fewer paths 
left after this trimming, the simpler (and, by most conventions, the better) the 
theory will be. 

The question for which there is no set answer is: When are the paths weak enough 
that they can be safely trimmed? Some researchers argue that paths can be cut 
if the path coefficient is below .05. The path can then safely be eliminated. It is 
often argued, though, that the decision should be on meaningful substantive 
grounds as well. The decision level should be justified by the individual re¬ 
searcher (and included in the research report). The decision is often made on how 
close a fit is obtained, that is, whether it is possible to reproduce the correlation 
matrix R in a reasonable fashion once paths are deleted. Again, this information 
should be included in the report. 

For more information on path analysis, please consult Kerlinger and Pedhazur 
(1973) and Duncan (1966). 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 17.3 

1. Bernstein (1985) noted that unidirectional claims have often been made about 
skills development in language teaching or learning. For example, audiolingual 
methodologists claimed that the skill areas were separate and that instruction 
should be sequenced from listening -» speaking ->■ reading -> writing. Noting 
that Krashen (1984) also advocates a reading -* writing sequence and that 
comprehension-based instruction argues for listening speaking, it seemed ap- 
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piupriatc to test out these claims. Assuming that each skill area is, indeed, a 
separate skill and that these differ in some way from global language competence, 
Bernstein set up a number of path diagrams (listening model, vocabulary model, 
reading model, grammar-translation model, and the audiolingual model.) One 
model, the listening model, is shown below. This model says that aural compre¬ 
hension precedes oral production, that development of oral skills precedes devel¬ 
opment of reading skills, and that there is a transfer of learning from listening 
competence to the other skills. (Note that Bernstein didn t test oral production 
or writing skills in this model, so they don't appear.) 



> 2. Draw a model that incorporates all four of the following premises of the 
audiolingual approach. Compare your model with Bernstein's model which is 
given in the answer key. 

a. There should be strict sequencing of skills: listening -» speaking -> 
reading -* writing. 

b. Primary emphasis is on oral repetition rather than aural comprehen¬ 
sion. 

c. Language is learned through repetition and manipulation of structural 
patterns. 

d. Vocabulary is limited, and relatively little emphasis is given to reading 
and writing. 
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3. Compare your diagram with those of other members in your study group. 
How do they differ? Which do you think best represents the audiotingual ap¬ 
proach? Why? 


4. In Bernstein's analysis, correlations generated by the data from 446 5s served 
as input into the path analysis procedure. Each of the models was tested. Here 
are the path coefficients for the listening model. 



For the model to be successful, we would like to be able to trim all paths except 
those shown in the First diagram. Can we do so? Why (not)?_ 
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In some of the models, path coefficients do drop into the .16 range; however, the 
results did not support any of these "theories" about skill sequence. In discussing 
the findings, Bernstein notes that "language proficiency" is a composite of these 
skills. (Global," however, was not measured as a composite but by cloze proce¬ 
dure.) Multicolinearity, therefore, is a problem since there is a strong relation 
among the skill areas and between skill areas and total language proficiency. 
Bernstein also reports that measurement experts, having performed principal 
component analysis and factor analysis, have not discovered separate dimensions 
for skill areas. Given that information, he suggests neither individual skill data 
nor general language competence data reflect separate components. Therefore, 
he argues, path analysts should not be used with such data. 

Do you agree with his final conclusion regarding the use of path analysis for such 
data? Why (not)?_ 


looooooooooooooooooooooooooooooooooooo 

Loglinear and Categorical Modeling 

The procedures discussed so far in this chapter are based primarily on correlation 
or regression of continuous (interval) data that are normally distributed. When 
our data arc nominal categories, wc know wc cannot use these procedures. 

Wc have already shown that Chi-square (* : ) is a good procedure when working 
with frequency data, where we count, for example, the number of monolinguals 
and bilinguals who are or are not required to take remedial English classes in 
college. Student type is a nominal category with two levels, and remedial class 
could also be a nominal variable with two levels (no remedial classes vs. one or 
more remedial classes). We might also believe that sex of student affects assign 
ment to such classes and so we add this as a second dichotomous nominal vari¬ 
able. Perhaps wc believe that socioeconomic status (SES) might mediate the 
effect of these other independent variables, and so we might add a three-level SES 
variable to the study. The problem with Chi-square is that wc cannot add more 
independent variables to the table and interpret the results. Nor can Chi-square 
analysis tell us about the interaction of variables. It cannot tell us which of these 
independent variables best predicts the actual distribution. Nor can Chi square 
be used to test causal claims. 

Categorical modeling procedures can do all these things for nominal data. The 
basic loglinear procedure is analogous to regression for continuous data. It allows 
us to consider how many of the independent variables and interactions affect the 
dependent variable. 
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Vou will remember that the Chi-square procedure is concerned with proportions. 
If we arc concerned with the distribution of students who do or do not have to 
;akc remedial English classes and who are or are not bilingual, the expected cell 
frequencies are calculated using proportions (column x row totals 4- /V). 
Loglinear, on the other hand, looks at how a nominal variable such as ±bilingual 
affects the odds that a student will or will not need to take a remedial course. In 
loglinear. these odds are found by di\iding the frequency of being in one category 
by the frequency of not being in the category. So, in our example, the odds would 
be that an individual selected at random would or would not be required to take 
remedial courses. If wc had access to national survey data and found that 1006 
5s had no requirement for remedial work and 419 did, the odds ratio would be 
1006/419 or 2.40. The odds are approximately two and a half to one against as¬ 
signment to remedial classes. This odds ratio is called marginal odds. 

We do, however, want to test whether the marginal odds are affected by the in¬ 
dependent variables. So, we calculate the conditional odds to discover whether 
the chances of being required to take remedial classes change if the 5 is 
monolingual or bilingual. 

Let's assume the following fictitious distribution: 

Assignment to Remedial Classes and Bilingualism 


Total I 

1006 | 

419 | 

Total 915 510 1425 

The table show's that the odds of being excused from remedial classes are 1.57 J 

(312 -T- 198) among bilinguals and 3.14 (694 4-221) among monolinguals. An \ 

odds ratio is then found by dividing these (3.14 4- 1.57). The odds of being re- f 

quired to take remedial classes are two times greater among bilinguals than I 

monolinguals. Not a very happy finding even for fictitious data! 

So far, a regular Chi-square analysis could handle the data quite well. 2 X 2 
contingency tables present no problem for Chi-square (aside from the necessary 
correction factor). However, it is quite possible that the findings might really be 
due to socioeconomic status (SES) of the 5s. When SES is added to the equation, I 
we may find that the odds are really more a reflection of SES than of 
bilingualism. We could then propose a "model" that says it is the interaction be- j 
tween SES and bilingualism that influences the odds of needing remedial in- I 
struction and test it against a "model" that links bilingualism and the need for ,, I 
remedial instruction. Chi-square could not help us here. Instead, w 7 e use \ 

loglinear procedures to .compare these two models and test for interactions in ) 

each. 

Model is used to stand for a statement of expectations regarding the categorical 
variables and their relationships to each other. In the above data, wc might 


Monolingual | 

Bilingual 

694 ! 

312 

221 

198 
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propose a model wheie J_bilinguafism (valiable A) is related to ±remeciiat course 
requirement (variable B). Using loglinear notation, this is model {AB}. Or. we 
might propose a model where SES (variable C) mediates the odds effect of 
±bilingualism on the odds of ±remedial course requirement {ABC}. Alterna¬ 
tively, we could propose a model where bilingualism and remedial courses are 
separately related, that these two arc jointly related to sex (variable D). a three- 
way interaction, and that bilingualism. SES, and sex are also mutually related. 
The model would be {AB} {ABD} {BCD}. The loglinear programs can test the 
competing models. There are several iterative algorithms that can be used to 
generate estimates of the expected cell frequencies for each model. These ex¬ 
pected cell frequencies (called maximum likelihood estimates) arc calculated using 
natural logarithms. (In the old days we would have had to look all of these up 
in logarithm tables and then do the calculations. Groan. But it's no problem fer 
the computer.) 

Once the program has produced the expected frequencies for the model, these are 
entered into the program to produce what arc called effect estimates (symbolized 
by tau t, or lambda X, or even x 2 ) for the variables and their interactions. 

Perhaps an example will help to make this complex process less opaque. De 
Haan (1987) used a very large text data base to replicate the finding that com¬ 
plexity of noun phrases depends on the function of the noun phrase. Previous 
research had been partly successful in showing that simple noun phrases tend to 
appear in subject function while structurally more complex noun phrases do not. 
Since Chi-squarc analysis cannot look at interactions, the research had been un¬ 
able to determine whether this finding holds across different text types (U\. 
whether there is an interaction between NP function and text type). 

Using loglinear analysis, dc Haan examined 25,210 NPs in either fiction or non¬ 
fiction text (using the Nijmegen corpus). The NPs were classified as BASIC head 
nouns (such as he. Harry, water or books), SIMPLE determiner + head nouns 
(e.g.. this article or my car), EXTENDED head nouns where the head is preceded 
by anything but a single determiner (e.g., my old car or expensive books), cr 
COMPLEX with at least one postmodifier irrespective of the structure of the rest 
of the NP (e.g., books for sale or that very tall man who just walked in). This gives 
us four levels of NP complexity: basic, simple, extended, and complex. Three 
functions were selected: subject, object, prepositional complement. And, two text 
types were chosen from the corpus: fiction and nonfiction. The design, therefore, 
is 4 X 3 X 2--a design that Chi-square cannot handle. 

The particular loglinear procedure used by de Haan produced lambda effects fer 
each main variable, for all the two-way interactions, and all the three-way inter¬ 
actions. This is the so-called saturated model, a model which includes all the 
variables and interactions. The tabic gives the reader what arc sometimes called 
parameter effects. It lists the effect for each main effect variable, then for each 
two-way interaction (and there are many of these because of the number of levels 
within each of the variables), and finally for all the possible three-way inter¬ 
actions. 

A small portion of the lambda table for the de Haan data is show n here: 
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effect 

lambda 

effect 

lambda 

BasicN 

0.792* 

BasicXFict 

-0.232* 

Simple 

—0.091 * 

SimpleXNFict 

-0.037 

Extend 

—0.499* 

ExtcndXNFict 

-0.007 

Complex 

-0.203* 

ComplexXNFict 

0.275* 


In a sense, the parameter table (lambda table in this example) is something like 
a combined ANOVA and multiple-range test report. It shows which of all the 
variables and interactions arc significant in the study. Dc Haan chose to inter¬ 
pret the significant parameters in this tabic. Using the tabic in a way analogous 
to ANOVA would mean beginning with interpretation of the three-way inter¬ 
actions (if they are interpretable), then the two-way interactions, and finally in¬ 
terpreting the main effects in light of the interactions. 

Using the saturated model (where all the variables and interactions are included), 
there are many interpretations to be made. There is a problem, however, when 
statistics are calculated for such a large number of lambdas. With multiple 
comparisons, the possibility of finding some estimates significant when there re¬ 
ally is no effect increases with the number of estimates. In the article, of course, 
the author interpreted those which are most central to the question of interest 
(i.e., docs the number of NPs vary by function in the different text types). As 
with ANOVA, the best way to illustrate interactions is with a figure. The fol¬ 
lowing figures show' the interaction between text type, function, and NP com¬ 
plexity. 
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BASIC SIMPLE EXTENDED COMPLEX 



BASIC SIMPLE EXTENDED COMPLEX 


j From this picture, it appears that subject NPs are usually basic nouns, but this 
: -is more true for fiction than for nonfiction. Nonfiction does not so clearly obey 
fthe "basic N in subject function" prediction. 

We have said that the loglinear process tests models. Actually, researchers usu¬ 
ally posit several models and evaluate them to see which best fits the data. In the 
/example above, the researcher did not test alternative models but interpreted the 
/saturated model where all the main effects and interactions are left in. In essence, 
this uses loglinear as an alternative to ANOVA for nominal, categorical variables. 
It doesn't, then, show the major strength of loglinear modeling, which is to test 
and compare models. 

|When models are tested against each other, the use of loglinear is analogous to 
|multiple regression. When models are tested against each other, the one with the 
/fewest variables and/or fewest interactions that successfully predict the observed 
/odds is the winner. Think of this in terms of multiple regression. There, the 
/"winning" variable is that which explains most of the variance. We can then look 
|to see how much more can be explained if we add the next variable, and so forth. 
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Here, loglinear tests each model and then allows us to test the difference between 
them to see which can best cover the odds that were derived From the frequencies. 

In testing the models (note, now, that we have moved away from the parameters 
and arc testing models), some computer packages with loglinear programs 
produce a y 2 statistic while others print out a likelihood ratio (abbreviated L 2 ). 
The larger the L 2 or y 2 (relative to the df), the more the model differs from the 
observed data. Since we want to find a model which does fit the data, we look 
for the lowest L 2 or y 2 value relative to the df. This is contrary to the usual in¬ 
terpretation of y 2 , and is similar to y 2 fit statistics found in factor analysis pro¬ 
grams. 

It is possible, since we want to find the one model that includes all the variables 
and interactions needed to come close to the original data, that we will select a 
model that is "too good." That is, we may potentially select a model that includes 
relationships that really only reflect normal sampling error. 

Consider our previous example for +bilingual, +remedial class, and 
SES—variables A, B, C. We want to find the best model for the data. If we start 
with the saturated model, the one that includes everything, we will have a perfect 
fit. Our first alternative model eliminates the three-way interaction. The model 
is {AB BC}~a model with two interactions (A X B and B X C) and three main 
variables (A, B, C). Say this model has a close fit to the data. The y 2 (2 df) is 
.01, p — .9931. It is very close to the full saturated model (which would include 
a three-way interaction and where p would equal 1.00 and the df = 0). A second 
alternative model (BC A} could then be proposed. This has one interaction (B 
X C); the main effects are still there but A no longer interacts with B or C. (That 
is, SES and +remedial classes interact but ±bilingual does not interact with 
t remedial classes. Say this model also fits the data: x 2 (3 df) is 4.43, p .21X7. 
The saturated model and the first alternative model gave us a super fit to the 
data. I he fit for the first alternative may still be "too good." Many researchers 
prefer a p between .10 and .30 to assure a good fit and a parsimonious solution. 
With a p of .22, the second alternative model looks like the better model-a close 
fit but not so close that it may include "error" relationships. 

If this model fits the data and is more parsimonious than the others, common 
sense would say to accept the model. But there is another check that can be 
made. If we subtract the first alternative model from the second, we can check 
to see if they differ significantly in how well they predict the data. Unfortunately, 
they do. y 2 (3 df) minus y 2 (2 df) = 4.42, p = .0356. This indicates that the two 
models are different and that the A X B interaction is significant. So, although 
the last model is more parsimonious, the researcher is left having to decide on 
substantive grounds whether the interaction can or cannot be dropped. 

The procedure may seem confusing, so we will give another illustration. Unfor¬ 
tunately, we have found no applied linguistics studies that have used loglinear to 
test models, so we must give you another fictitious example. The values that are 
used to illustrate the procedure are based on a study by Ries and Smith (1963); 
the actual loglinear analysis is from'the S/15 User's Guide: Statistics (1985, 
225-228). 
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Imagine that you wanted to know students' preference for two different language 
teaching methods. Method A is project-based (where students work both indi¬ 
vidually and in groups to carry out a project such as researching and drawing up 
a plan for improving student counseling) and method B is teacher-centered in¬ 
struction (where students work primarily on an individual basis to meet course 
objectives). You believe that these preferences may vary depending on age level 
(three age groups 18*20, 20-30, 30 and over). You also think that preference may 
be influenced by whether 5s have had previous experience with the method. 
Finally, you decide to include sex as a third variable that might relate to prefer¬ 
ence. 

The design includes one dichotomous dependent variable: method choice: A or 
B. There are three independent variables: Age has 3 levels, Previous Experience 
(with the preferred method) has 2 (yes = I, no = 2), and Sex has 2 (female = 
I. male = 2). Assume that through the magic of the research committee and the 
Adult Education Special Interest Group of TESOL, you were able to obtain the 
following data. 


Population and Response Profiles 


Sample AgeGp PrExp 

Sex 

/V 

MethA 

MethB 

ProbA 

ProbB 

1 

3 

2 

2 

72 

30 

42 

.417 

.583 

2 

3 

2 

1 

110 

42 

68 

.382 

.618 

3 

3 

1 

2 

67 

43 

24 

.642 

.358 

4 

3 

1 

1 

89 

52 

37 

.584 

.416 

5 

2 

2 

2 

57 

23 

33 

.412 

.589 

6 

2 

2 

I 

116 

50 

66 

.431 

.569 

7 

2 

1 

2 

70 

47 

23 

.671 

.329 

8 

2 

1 

1 

102 

55 

47 

.671 

.329 

9 

1 

2 

2 

56 

27 

29 

.482 

.518 

10 

1 

2 

2 

116 

53 

63 

.457 

.543 

11 

1 

1 

2 

48 

29 

19 

.604 

.396 

12 

1 

1 

1 

106 

49 

57 

.462 

.538 


Look at the top line of the table. There were 72 students in age group 3 who had 
no previous experience with the method they chose, and who were male. Of these 
72 students, 30 chose method A and 42 chose method B. The probability of se¬ 
lecting method A for this group of students was .417, and .583 for method B. 
The second line shows the choices of the 110 students who belonged to age group 
3. who had no previous experience with the method they selected, and who were 
female. Since there were three age groups, two experience levels, and two levels 
for sex, there were 3x2x2= 12 samples. 

The saturated model, the one which includes all the variables and all the inter¬ 
actions gives us a y 2 value for each variable and interaction (called parameters). 
The y 2 statistics here are not "goodness of fit" statistics. They are read as ordi¬ 
nary z 2 statistics. Those with high y 2 values and low probabilities are the pa¬ 
rameters that should be retained in formulating alternative models. (You will 
notice there is an extra parameter for age because this is a three-level 
ftrichotomy] category rather than a dichotomy.) 
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Analysis of Individual Parameters 


Effect Parameter 

Estimate 

Std.Err. 

X 2 

P 

Intercept 

1 

.031 

.067 

0.21 

.651 

Age 

2 

-.004 

.094 

.00 

.960 


3 

.028 

.095 

.09 

.769 

Prev Exp 

4 

-.314 

.067 

21.80 

.001 

AgeXPrev Exp 

5 

-.121 

.094 

1.67 

.196 


6 

-.064 

.095 

1.67 

.196 

Sex 

7 

.128 

.067 

3.63 

.057 

AgeXSex 

8 

-.031 

.094 

.11 

.741 


9 

-.009 

.095 

.01 

.919 

Prev ExpXSex 

10 

-.101 

.067 

2.25 

.134 

AgeXPrExpXScx 11 

.077 

.093 

.66 

.415 


12 

-.059 

.094 

.39 

.531 


This is the saturated model with all variables left in. The SAS printout has given 
us x 2 values to interpret, and a quick glance at them shows that previous experi¬ 
ence is significant, sex approaches significance in predicting preference, and pre¬ 
vious experience and sex interact to some degree in shaping preferences though 
the interaction is not statistically significant. 

The residuals left for the saturated model are nil, since everything is included; the 
model matches the observed data perfectly. Let's see what happens when we try 
to get rid of the interactions. Can the main effect variables give us a good esti¬ 
mate of the actual odds in the original data? 

Here are the results for the main effects model: 


Analysis of Individual Parameters 


Effect Parameter 

Estimate 

Std.Err. 

X 2 

P 

Intercept 1 

.020 

.067 

.20 

.652 

Age 2 

-.009 

.091 

.01 

.913 

3 

.039 

.090 

.190 

.664 

Prev Exp 4 

-.28 

.064 

19.28 

.001 

Sex 5 

.128 

.067 

3.63 

.057 


From the table it appears that only previous experience with method is signif¬ 
icant. Let's look, though, at how well this model fits the original data. If you 
compare the x 2 values in the two tables, you will see that the shifts are small. 
The computer also gives a value for a "residual." This is a measure of how well 
the data fit the model. The residual "goodness of fit" statistic for this model is 
8.18, 7 df ’ and the probability is .317. (In a perfect fit, the probability would be 
.00.) We have already noted that researchers usually look for probabilities 
somewhere between .10 and .30 to assure that all important variables and inter¬ 
actions are left in but that nothing extraneous is included when it shouldn't be. 
We don't want too good a fit. A probability of .317 seems to be fairly acceptable. 
If you compare the two tables, you can see that this difference in probability is 
perhaps connected to the interaction between previous experience with the 
method selected and sex. We might, then, propose and test another model, add¬ 
ing this particular interaction to the main effects. We could compare it with the 
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main effects model and make a final decision on the best fitting model for the 
data. The final decision would be tempered by substantive argument as well as 
by the p levels. 


Categorical Models and Causal Analysis 

Loglinear has several variations that fit different uses. For example, it is possible 
to do "path analysis" on nominal data using loglinear techniques. The resulting 
figures are very similar to those used in path analysis. For example, assume we 
thought we could predict attitudes towards bilingual education by some combi¬ 
nation of respondent's age, education, and the geographical area of residence. It 
is impossible to run a regular path analysis because not all of the variables arc 
really continuous in nature. Loglinear, as with regular path analysis, will allow 
us to determine whether the model can be "trimmed" so that all of the paths do 
not need to be maintained. The following diagram is based on fictitious data. 



Depending on our cutoff point for trimming paths, we might trim the one be¬ 
tween region and education and the one between age and attitudes. The result 
would show that while the effect of age on attitudes was mediated by education, 
the effect for region was not. 


Categorical Modeling with Repeated-Measures 

In loglinear procedures, models are tested to see which best fit the observed data. 
One of the loglinear procedures which is called "loglinear" makes no distinction 
between independent and dependent variables (although, as in our examples, they 
can be interpreted in this way). Another related loglinear procedure called 
CATMOD (for categorical modeling) does distinguish dependent and independ¬ 
ent variables. These two procedures are basically the same. SAS calls them both 
categorical modeling; other books call them both loglinear. The time when it is 
really important to make a distinction is when the data appear in a repeated- 
measures design. 
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Both loglinear and CATMOD programs can be used for time series studies. They 
both can be used for between-5s designs (where, say, questionnaires are collected 
from different 5s over time). For example, either method could be used to look 
at the attitudes towards bilingual education at five-year time periods. If region 
was an important variable, the country might be divided into, say, five major 
areas (Northeast, Midwest, South, Northwest, Southwest). Attitudes might be 
placed in five levels (highly positive, positive, neutral, negative, highly negative). 
The frequencies at the two time levels could then be subjected to cither loglinear 
or CATMOD to Find the best fit for the data. 

CATMOD is, however, special because it can be used to analyze time series 
studies for nominal variables where the data are from the same 5s. You will re¬ 
member that one of the assumptions of the Chi-square analysis is that the data 
are independent. Chi-square cannot be used for repeated-measures designs. 
CATMOD is a procedure which can handle data such as those given in the fol¬ 
lowing examples. This is a fairly complex design. Since, again, we were unable 
to find examples of repeated-measures categorical modeling in applied linguistics 
literature, we have adapted an example from Koch et al. (1977) using the data 
which appears in the SAS manual (5.45 User's Guide: Statistics, 1985, 234-236). 

Imagine that you worked in a private language school which dealt with students 
who had intercultural communication problems. Some portion of the students 
have suprascgmentals that are judged as cither highly annoying or moderately 
annoying because they do not match those of native speakers. The school offers 
a special program which these students may elect to take. Alternatively, they can 
take a regular communication class which treats many different areas of com¬ 
munication. The school wants to know whether there is any difference in out¬ 
comes for these two groups of 5s (moderately annoying vs. highly annoying) given 
these two treatments. Data were gathered from both groups. At three times 
(weeks 1, 2, and 4), the students were videotaped as they carried out a conversa¬ 
tion with a native speaker. Three raters reviewed a one-minute segment from 
each conversation, judging the learner's speech as either "acceptable" or "irritat¬ 
ing." In cases of disagreement, a second one-minute segment was reviewed. In 
all cases, judges were able to agree following two ratings. The population profiles 
are as follows: 


Population Profiles 


Sample 

Group 

Class 

N 

I 

moderate 

com mu n 

80 

2 

moderate 

supraseg 

70 

3 

high 

commun 

100 

4 

high 

supraseg 

90 


There are three repeated ratings. It is possible to obtain one of 8 different re¬ 
sponse patterns. If these are abbreviated A = acceptable and I = irritating, the 
response patterns for the three w'eeks could be III, 11 A, IAI, IAA, All, AIA, 
AAA. These are 8 different possible response patterns which are numbered 1 
through 8 respectively. 
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The individual ratings over the three periods are entered into the computer along 
with information on each person's original referral status (moderately vs. highly 
annoying) and the class selected (communication class vs. suprasegmental class). 
SAS then gives this information back in the form of response frequencies. 


Pattern Response Frequencies 


Samp. 

l 

2 

3 

4 

5 

6 

7 

8 

1 

16 

13 

9 

3 

14 

4 

15 

6 

2 

31 

0 

6 

0 

22 

2 

9 

0 

3 

2 

2 

8 

9 

9 

15 

27 

28 

4 

7 

2 

5 

2 

31 

5 

32 

6 


To read this table, first identify the sample group. Sample group 1 includes those 
people who are prediagnosed as having suprasegmentals that are moderately ir¬ 
ritating and who took the regular communication class. There are 80 such peo¬ 
ple. These 80 people each received a rating at three separate times. Those who 
did not improve (i.e., were rated as I, irritating, at each of the three time periods) 
are in response group 1. Sixteen of the 80 S are in that response group. Thirteen 
received IIA ratings, 9 received 1A1 ratings, and so forth. 

CATMOD works out the odds ratios and runs the analysis for the saturated 
model. Then, on request, it reruns the analysis for the main effects (original di¬ 
agnosis and class selected without the interactions). 

The parameter output (which is read as a regular y 2 , not as goodness of fit) is 
shown below. It shows which parameters to keep if we want to try a subsequent 
model. 


Analysis of Individual Parameters 


Param 

Estimate 

Std.Err. 

r 

P 

mod-comm 

-.072 

.135 

.28 

.5960 

mod-supra 

-1.353 

.135 

100.48 

.0001 

high-comm 

.494 

.096 

26.35 

.0001 

high-supra 

1.455 

.130 

125.09 

.0001 


This table shows precisely where significant results (improvement) were obtained. 
Students who entered the program with suprasegmentals that were highly an¬ 
noying improved in either treatment. That is, they improved if they w'ere in the 
communication class (y 2 = 26.35, p < .0001) and they improved if they were in 
the suprasegmental class (y 2 = 125.09, p < .0001). Those Ss who entered the 
program with moderately annoying suprasegmentals improved significantly in the 
suprasegmental class (y 2 = 100.48, p < .0001), but not in the regular communi¬ 
cation class. 

This should suggest that while CATMOD gives y 2 values and while it uses sta¬ 
tistical procedures akin to loglinear, the final printout and interpretation is not 
dissimilar to analysis of variance followed by a multiple-range test. Of course, 
data such as those described above could not be analyzed via analysis of variance 
since the data do not meet the requirements of such tests. 
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I he question still remains as to whether this "main effects" model fits the original 
data. To find out, wc check the residual "goodness of fit* statistic in the printout. 
Yes. it doe ,; . The residual "goodness of fit" statistic shows a low y 2 value and a 
high probability (y 2 , 8 elf 4.20, p - .8387) . ! hat's a very good fit to the ori¬ 
ginal observed data. 

The goodness of fit statistic argues for acceptance of this model as a more 
parsimonious solution. However, the fit may still be "too good" given the .8387 
probability for fit. We would want to see what changes we could make to have 
the model even more concise. However, looking back at the parameter table, wc 
see that the only parameter left to eliminate would be the first parameter 
(moderate-communication class). It makes no sense to delete this from the model. 

The example we have just presented is very complex since it contains so many 
different response patterns. However, it is not atypical of the types of designs 
and tasks used in applied linguistics. We wanted to show you the possibilities for 
analysis using categorical modeling for complex designs. Let's turn to something 
simpler as a practice exercise. 

ooooooooooooooooooooooooooooooooooooo 

Practice 17.4 

1. Immigrant and foreign students either do or do not take advantage of a uni¬ 
versity counseling program. Their status (Tprobation) is recorded at three time 
periods. The research asks whether counseling has any effect on probationary 
status. 

Dependent variable?_ 

Independent variables?__ 

Explain why this is both a between-subjects and a repeated-measures design. _ 


Assume you had 80 5s in the + counseled immigrant group and 100 in the 
—counseled immigrant group. There are 180 Tcounseled foreign students and 7Q 
—counseled. The "response" pattern for status over the three time periods could 

be + + +. + + —, H-h, — + + , — + —, etc. Pretend you are a computer, and 

give a response profile for your own fictitious data set. 

Sample Group 1 2 3 4 5 

1 +Cons Immig 

2 -|-Cons FS 

3 —Cons Immig 

4 -Cons FS 

Assume that the saturated model showed that the + counseled group does im¬ 
prove probationary status over time. It appears this is somewhat more true of 
immigrant than foreign students. The printout for the reduced model with only 
main effects (no interaction) shows a residual y 2 of 5.09 and a probability of .532. 
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Would you accept the reduced model, try another model, or go with the saturated 
model? Why?_ 


ooooooooooooooooooooooooooooooooooooo 

The results of the procedures used in this chapter are not always easy to interpret. 
Each procedure performs a variety of statistical manipulations which are so 
complex that they cannot readily be carried out by hand. Because they are so 
complex, it is often difficult to get a handle on the logic behind the procedures. 
In addition, the researcher is asked, in the end, to make crucial decisions regard¬ 
ing, for example, the number of factors to accept and the naming of the factors 
in principal component analysis and factor analysis, and multidimensional scal¬ 
ing. In categorical modeling, the researcher must decide which model best pre¬ 
dicts the observed distribution of frequencies in a way that doesn't over- or 
underpredict the number of variables or interactions to include. Path analysis 
and the use of causal modeling in loglincar procedures also require that the re¬ 
searcher make reasoned decisions, first, in theory formation and, second, in de¬ 
ciding on cutoff points that will allow paths to be "trimmed" in theory testing. 
These decisions should be influenced not just by statistical arguments but by the 
substantive issues inherent in the research question. All of this assumes a so¬ 
phistication beyond that which one can normally expect to attain without sus¬ 
tained practice in use of the techniques. It is, therefore, especially important that 
you read as many source manuals as possible on each procedure and work with 
a seasoned statistical consultant when applying these procedures. 

This is not meant to discourage you from use of the procedures. As our field 
moves into more sophisticated studies with complex designs and large numbers 
of observations or 5s, these procedures open up many new avenues for data 
analysis, exploratory work towards theory, and theory testing particularly with 
large-sample-size studies. The use of all these procedures in applied linguistics 
research will surely grow over the next few years. 


Activities 

1. J. Oiler (1983. Evidence for a general language proficiency factor: an expec¬ 
tancy grammar. In Oiler, J. [Ed.]. Issues in Language Testing Research. New 
York, NY: Newbury House, 3-11) performed a PCA analysis on the UCLA ESL 
Proficiency Exam which consisted, then, of five parts: vocabulary, grammar, 
reading, dictation, and cloze. Four different data sets were analyzed in the study. 
The part illustrated below is for one set consisting of 119 5s. The first component 
accounted for 76.1% of the total variance. Each test contributed most of its 
variance to this first component. 
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Subtests 1st Component 


Vocabulary .85 

Grammar .88 

Reading .83 

Dictation .89 

Cloze Test .91 


Oiler (following Nunnally [1967]) notes that if the principal components analysis 
exhausts the reliable variance in the contributing variables, then the simple cor¬ 
relations of pairs of tests should equal the loading of one variable on the first 
component multiplied by that of the second variable. So, for example, the cor¬ 
relation of vocabulary and grammar should be close to .85 x .88. Checking the 
correlation matrix, we find the actual correlation is .67. Estimating the corre¬ 
lation for each pair of variables from their loading on the principal component 
shows how well the first component, alone, predicts the original correlation ma¬ 
trix. The "residuals" table below shows what would be left if the first component 
were removed. 


Residuals from Correlations Minus Multiples of Factor Loadings 



Vocab 

Grammar 

Rdg 

Diet 

Cloze 

Vocab 

— 

-.07 

-.06 

-.07 

-.07 

Grammar 

Rdg 

Dictation 

Cloze 



-.11 

.01 

-.09 

-.04 

-.03 

.00 


Since the residuals are small, it appears that the component docs indeed cover 
most of the variance. 

This particular article has been the center of much controversy since it used an 
unrotated principal component analysis. Because this is a critical issue, Oiler in¬ 
cluded the original paper and Farhady's criticism (1983) in this volume. Review 
the statistical and substantive arguments in the original paper and the criticism 
of the procedure in Farhady. Do these arguments lead you accept the sequence 
of procedures proposed by Davidson (1988) for PCA and FA: "PCA —> scree plot 
of PCA evaluations -> FA extraction to a range of possible n factors (e.g. 1 to 4 
or 5) and rotate each extraction both orthogonally and obliquely -* check the 
substance of each n factor extraction (i.e. does it make sense, or put another way, 
can the criterion of interpretability help decide on an n factor?) -* commit to one 
ri factor, interpret it further, and report it"? Why (not)? 

2. T. Robb, S. Ross, & I. Shortreed (1986. Salience of feedback on error and its 
effect on EFL writing quality. TESOL Quarterly, 20, 1, 83-95.) did a factor 
analysis as a preliminary procedure to an analysis of feedback methodology on 
EFL writing. The authors analyzed and graded 676 narrative compositions 
(written by 134 Japanese college students at five equal time intervals). A factor 
analysis was performed on each of these five sets individually. The factor ana¬ 
lyses essentially showed the same results. Three factors were found: those 
grouped by "error-free" criteria (accuracy factor), by amount of information pre- 
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sented, such as the number of words or clauses (fluency factor), and by "addi¬ 
tional" clauses (complexity factor). 

Before reading the article, review the table on pages 94-95 of the article. Look 
at each measure that contributes to factor 1. Do you agree that "accuracy" is a 
good name for this factor? 

Repeat this process for factors 2 and 3. Now look at the communality column. 
Which measure has the lowest communality? Which has the highest? On which 
measure does communality seem to vary across the five time groups? Remember 
that total variance of a test consists of common variance and unique variance 
(specific variance + error variance). How do you interpret the differences in 
communality across measures and the occasional differences within a measure 
across time periods? Fortunately for the reader, the authors present only loadings 
above .30 in this tabic (this makes it much easier to read). Would you like to 
know the value of those below .30? Why (not)? Look at those tests which load 
on more than one factor. Do the measures which contribute to more than one 
factor do so across each test administration? If this is not the case, how would 
you explain these discrepancies? Now read the article and check your interpre¬ 
tation of the table against that given by the authors in the results section. 

3. D. Biber (1986. Spoken and written textual dimensions in English: resolving 
the contradictory findings. Language, 2, 384-414.) used factor analysis to reduce 
some 41 linguistic features in 545 text samples consisting of approximately 2000 
words each. The analysis yielded three underlying factors which he labeled 
Interactive vs. Edited text, Abstract vs. Situated Content, and Reported vs. Im¬ 
mediate Style. In this article, the author then goes on to show how specific 
findings from earlier studies fit with this three-factor model. If this were your 
study, what other possible procedures might you want to use? What rationale 
can you give for each choice? 

4. I. H. Ijaz (1986. Linguistic and cognitive determinants of lexical acquisition in 
a second language. Language Learning, 36, 4, 401-451.) investigated the 
meanings of six spatial prepositions in two categories, ON {on, upon, onto, on top 
of) and OVER (over, above). Two hundred sixty-nine 5s, both native speakers 
and ESL learners, were given two tests to ascertain the meaning of these prep¬ 
ositions. The semantic relatedness test had subjects draw an X, in relation to a 
line, which showed the relationship in question; the other test was a cloze- 
sentence completion. Results drawn from an MDS analysis showed that ESL 
learners differed significantly from native speakers in the semantic boundaries 
they ascribed to these spatial prepositions. The author concluded that the 
meanings that words have for ESL learners are influenced by contextual con¬ 
straints, cognitive factors, and LI transfer. 

This study has a great deal of information packed into a small space. Authors 
with dissertation research are often faced with this problem: should all the 
findings be shared at once (as they are here) or might the dissertation be reason¬ 
ably presented in two or three articles? If this were your dissertation, which route 
would you choose? Why? 
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X R. Mn/avin (1987. T he graphic representation of language competence: map¬ 
ping I F!, proficiency using a multidimensional scaling technique. Language 
Teeing Research: Selected Papers from the 1986 Colloquium. Defense Language 
Institute, Monterey, California.) notes that one of the ways of answering the 
question "What is the test actually testing?" is to ask, Mow many significant 
identifiable' separate factors is the test testing?" The data here consisted of re¬ 
sponses to 36 items on a cloze, an elision test (where extraneous words are to be 
crossed out), and a multiple-choice vocabulary test, tests taken by adult Egyptian 
EFL students. The paper presents the argument that factor analysis is not really 
appropriate when searching for unidimcnsionality within tests where the items 
arc usually ± dichotomies (it*., right or wrong) rather than linear measures. 
Therefore, an MDS analysis was performed and 2-dimension, 3-dimcnsion, 
4-dimension and 5-dimension solutions given. The displays, to the extent that 
they arc interpretable, show' the cloze test items as neighboring clusters. Partic¬ 
ular cloze items cluster more closely to each other than to other cloze items in the 
diagrams. By and large, the elision items are closer to each other than to the cloze 
items (though the clustering among them is not so tight). 

The author also raises the issue of whether stress levels might be higher (and ac¬ 
ceptably so) when MDS is used to analyze items within a test. If your interest is 
in test development or language testing in general, review this excellent article. 
If your group shares research reports, prepare a summary of the arguments pre¬ 
sented in the paper, explain how Rasch statistics are integrated into the inter¬ 
pretation of the results. Finally, think about the practical explanation given for 
the cloze vs. elision clusters and report your conclusions. 

6. R. Kirsncr (1989. Does sign-oriented linguistics have a future? In Y. Tobin 
[Ed.]. From Sign to Text: A Semiotic View of Communication. Amsterdam: 
John Benjamins, 161-178.) posited a "retrieval distance" or "referential distance" 
hypothesis to explain the position of Dutch deictic terms in the sentence. The 
path coefficients are given in the diagram below. 



The coefficient indicates, in each case, the relative effect of the second variable 
on the first. Thus, the diagonal path from Demonstrative Type to Demonstrative 
Position with a coefficient of .023 show's that the direct influence of 
Demonstrative Type on Demonstrative Position is negligible. Since .023 is below 
the .05 cutoff point, the path can be deleted. In contrast, the influence of Re¬ 
trieval Distance on Demonstrative Position is very high, p — 815. And the in- 
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flue rice of Demonstrative Type on Retrieval Distance is also meaningful 
(/> -- .466). Kirsner, therefore, concluded that there is no direct effect of 

Demonstrative Type on Demonstrative Position. The effect is mediated through 
the intermediate variable of Retrieval Distance. This allowed him to refute an 
opposing hypothesis (the "attraction hypothesis") and make a causal claim that 
Retrieval Distance determines Demonstrative Position. 

If this were your study, which of the two types of path analysis would you select? 
Why? 

7. W. Tunmer, M. Hcrriman, & A. Ncsdalc (198S. Metalinguistic abilities and 
beginning reading. Reading Research Quarterly, 23, 2, 134-158.) present the re¬ 
sults of a two-year longitudinal study which examined the metalinguistic abilities 
of beginning readers. One hundred eighteen 5s were given a variety of tests at 
the beginning of first grade, at the end of first grade, and at the end of second 
grade. The pre-first-grade measures were (a) three tests of metalinguistic ability 
(phonological, syntactic, and pragmatic awareness), (b) three prereading tests 
(letter identification, concepts about print, and ready to read word), (c) the 
Peabody Picture Vocabulary Test (verbal intelligence), and (d) a test of concrete 
operational thought. 

The post first grade measures consisted of the three tests of metalinguistic 
awareness, the three of prcrcading skills, and three reading achievement tests 
(real word dccotling, pscudoword decoding, and reading comprehension). The 
three reading achievement tests served as the post-second-grade measures. 

flic authors present a number of analyses, including three path analyses. The 
figure reproduced below is the causal path diagram for reading comprehension 
at the end of' the second grade. 



Chapter 17. Other Statistical Tests Used in Applied Linguistics 525 


Decide which of the paths you believe euuid be tiimmed. (Remember that trim 
decisions are based on both statistical and substantive arguments.) When you 
have trimmed these paths, write a causal statement about the relation of the re¬ 
maining variables to reading comprehension for the 5s at this grade level. Before 
you compare your statement with that of the authors, decide which type of path 
analysis was used (one based on regression or one based on loglinear analysis). 
Then turn to the article to compare interpretations. 

8. C. M. Ely (1986. An analysis of discomfort, risktaking, sociability, and mo¬ 
tivation in the L2 classroom. Language Learning, 36, I, 1-25.) proposed a model 
of language learning which included class participation and ultimately language 
proficiency. Seventy-five Ss in six Spanish language classes responded to a vari¬ 
ety of attitude measures reflecting discomfort, risk-taking, and sociability in the 
language class, as well as feelings toward the class, importance of the grade, and 
strength of motivation. Ss were also observed for classroom participation. These 
factors became independent variables. The dependent variable, language profi¬ 
ciency, was measured by a story retelling task (scored for correctness and fluency) 
and a final written exam. Results showed that risk-taking positively predicted; 
participation. Discomfort negatively predicted risk-taking and sociability. 
Finally, participation positively predicted oral correctness. The model is pre¬ 
sented with a path diagram and a series of tables give regression results. 

If you are interested in classroom research, this study presents a useful model. 
Are these the variables that you would want to include in a model of classroom^ 
learning? Why (not)? What factors would you include and how would you ar¬ 
range them in the model? Look at the variables you have chosen. If this were 
your research, which path analysis would you use (regression-based or loglinear)? 
Why? What other statistical tests might you use for the data? Justify your 
choice. 
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Chapter 18 

Assumptions of Statistical Tests 


•Reliability of measurement 
•Estimating reliability 
Test-retest method 
Parallel test method 
Interrater reliability 

Internal consistency (split-half, K-R 20, K-R 21) 

• Validity of measurement 

• Validity of research 

•Guide to selecting the appropriate statistical procedure 
•Assumptions of specific statistical tests 


Reliability of Measurement 

Before any statistical procedure is applied to test hypotheses, we assume that the 
researcher has checked to be certain that the measurement of variables is both 
valid and reliable. (In some cases, of course, the research question may be 
whether a particular test is a reliable and valid test.) The reliability of measures 
used in research projects should always be reported. Reliability of the data is an 
assumption behind all statistical procedures, but to satisfy the researcher and the 
reader we must report just how reliable the data are. Even when a test has been 
shown to be a reliable and valid measure in previous research, you will want to 
check to see how reliable it is in your research. That is, just because a test has 
been shown to be reliable for one group of 5s does not mean it will (or won't) be 
for your 5s. 

We have not included a thorough discussion of test reliability and validity in this 
manual since these topics are treated in books on testing and measurement 
(rather than in research design and statistics). We suggest that you consult such 
excellent sources as Henning (1987), Bachman (1990), or Popham (1981) for a 
full discussion of this issue. These authors give careful consideration for all types 
of validity-face validity, construct validity, content validity and both predictive 
and concurrent criterion-related validity. Bachman also deals with the grey area 
between reliability and validity. However, in case you are not able to find other 
resources for estimating reliability, we will present the classical formulas for reli¬ 
ability in this chapter. 
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A perfectly reliable measurement is completely accurate, free of error. We al¬ 
ready know that some error will occur; measurement is never perfect. Our goal 
is to minimize error in measurement so that the results arc true, accurate mea¬ 
sures of performance. Reliability is usually defined as the extent to which a test 
produces consistent, accurate results when administered under similar conditions. 
Whatever type of data you collect, you trust that they arc reliable. If you took 
a test on Monday and scored 100 points, you would be surprised if you took the 
test the following day and scored 20. The results would not be consistent or re¬ 
liable. You would have little confidence in the accuracy of the scores obtained. 
You might wonder if the proctors used the wrong correction key. If you asked a 
child to copy a set of words on five different days and the results were radically 
different each day, the results would not be consistent or reliable. Without reli¬ 
able results, there is no point in subjecting the data to any statistical analysis. 
Rather, the data would be described in terms of inconsistent or variable perfor¬ 
mance. 

There are many factors that contribute to unreliable data-measurement error, 
fatigue of Ss, problems with the data collection environment, Ss' lack of famil¬ 
iarity with a particular type of test, and so forth. All of these things cause error 
in measurement. Think back again to chapter I where we gave you the following 
diagrams related to design. 

Lll'TL.l. threat to validity: 



These figures are helpful in thinking about measurement reliability as well as 
validity of design. Obviously, we want our measurement to be accurate, with as 
little error" as possible. We know that in any distribution, scores spread out 
from the point of central tendency. That variability of performance is partly true 
variance (individuals place where their true scores fall) and partly error variance 
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(the scores aren't perfect reflections of true ability and so fall in slightly different 
places). A logical definition of reliability is the proportion of variance that is true 
variance. 

If you think about it for a moment, you will probably be able to predict that we 
will use correlational methods to estimate reliability. If we score 100 points to¬ 
day, we expect to stay in the same relative place in the distribution tomorrow. 
We may not score exactly 100 the next day, but if we scored 20, we would be 
surprised. We expect a high correlation. If there is high error variance-scores 
do not yield accurate placement in the distribution-our relative placement will 
not be the same from moment to moment. The correlation and the reliability will 
be low. 

From your work on correlation, you know that if the data are not normally 
spread tnroughout the distribution, the correlation will be meaningless. In terms 
of reliability, when the test is too easy or too difficult, the scores will clump to¬ 
gether at one or the other end of the scale. The resulting correlation will be un- 
intcrprctable and the reliability low. Similarly, if 5s have a wide range of ability 
so that their scores spread out across the range, the correlation and reliability will 
be interpretable. However, if everyone has the same ability (e.g., if .Vs just took 
an intensive program of instruction and the test is referenced to the content of 
instruction), the range will be so small that the correlation and reliability of 
measurement will be low. This does not mean that the data are unreliable, only 
that ord.nary correlational methods are not the best methods for establishing the 
reliability of the data. 

In addition, it is difficult to obtain reliable data from some -Vs.. Data of preschool 
and kindergarten children are particularly variable. Data are often collected at 
several different sessions and carefully checked for reliability of performance. 
Data from beginncr-lcvcl second language learners may also be inconsistent. 
While researchers should report the reliability of measurement for all studies, it 
is especially important that reliability of performance be studied in fields such as 
ours. 


Estimating Reliability 

In classical theory, there are several ways to estimate reliability: (1) for consis¬ 
tency over time-correlation between test-retest scores; (2) for equivalence ir. 
form-correlation of parallel or comparable tests; (3) for equivalence in 
judgment—interrater reliability checks; and (4) for consistency within a test. 


Test-Retest Method 

Test-retest correlation shows us stability over time. 5s may improve from time I 
to time 2, but they still will be rank-ordered in the same way. The 5s at the top 
in the first test will still be at the top in the second, and so forth. (To minimize 
the impact of iearning/forgetting and maturation, Henning f 1987] suggests that 


Chapter 18. Assumptions of Statistical Tests 531 


the time lapse should be less than two weeks.) If you give a test twice to the same 
5s (or collect data from the same 5s or texts twice), you can run a Pearson cor¬ 
relation on the results and report this as the reliability coefficient. An r in the 
high .80s or .90s would show that the data are reliable (i.c., consistent and 
trustworthy). 

Possible sources of error variance (and lower reliability) inherent in this method 
include setting, time span, history, and the 5s' physical and ps 3 'chological state. 
If the room is hot, ventilation poor, lighting dim, the results may change. The 
time span may be too long. If 5s have been given feedback on the results of the 
first test, this history factor could change the distribution of scores at time 2. It 
may be difficult to obtain the same enthusiasm for the procedure the second time; 
some 5s may become bored. These factors should not discourage you from using 
this method. Rather, they are potential effects that you should try to minimize. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 18.1 

1. Plann (1979) investigated the acquisition of Spanish gender (agreement of ad¬ 
jectives and articles with nouns) by Anglo elementary school children in an 
immersion program. She devised a puzzle game where children had to request 
puzzle pieces (e.g., los conejos blancos (masculine plural), la mariposa amarilla 
(feminine singular), or other familiar animal and color names) in order to com¬ 
plete their puzzle. Separate scores were tallied for agreement by number and 
gender (masculine and feminine) of each adjective and article. However, a total 
score could be assigned to each child by computing the total scores on these items. 
If you wanted to estimate the reliability of the data, what problems would you 
need to address using the test-retest method?_ 


2. Give two suggestions that might help encourage retest enthusiasm of adult 5s 
for a reading test._ 


OO0O0OOOOOOOOOOOOOOOOOOO0OOOOOOOOOOOO 

Parallel Test Method 

Parallel tests are sometimes administered to the same 5s at. the same time. If 
5s' scores on these two measures of the same thing correlate in the high .80s or 
.90s, we can assume that the data are reliable and trustworthy. (Equating tests 
is another field of research in itself.) 

There are possible sources of error inherent in this method as well. The two tests 
may not be exactly the same. The content in "parallel" tests may differ, the pro- 
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ccdure may differ, or the two tests may require slightly different skills. In addi¬ 
tion, 5s may become fatigued when the two tests arc given at the same sitting. 
If the tests are "speeded," pressure plus fatigue may result in low test reliability. 
Again, the researcher (who is fortunate in having located two parallel tests) must 
consider these issues in planning test administration. 

ooooooooooooooooooooooooooooooooooooo 

Practice 18.2 

1. Following a reading improvement course, you want a reliable measure of 
reading comprehension. You administer the ESLPE reading comprehension 
subtest (passages show general academic content and questions are multiple 
choice) and the TOEFL reading subtest. If you use the parallel test method to 
estimate reliability, what difficulties would you need to address?_ 


ooooooooooooooooooooooooooooooooooooo 

Interrater Reliability 

In research reports you will frequently find figures given for interrater reliability 
where several judges have been asked to rate compositions or, perhaps, the 
speaking performance of language learners. To have confidence in the ratings, 
we need information on interrater reliability. Since the 5's final score is the 
combination or average of the ratings, reliability depends on the number of 
raters. (The more raters, the more we trust the ratings and so the more raters, 
the higher the reliability.) 

To compute interrater reliability for more than two raters, we first correlate all 
the ratings (producing a Pearson correlation matrix) and then derive an average 
of all the correlation coefficients. You'll remember in the chapter on correlation, 
we used Spearman correlation for rank-ordered data. Here we have used the 
Pearson. In order to correct the distortion inherent in using the Pearson for 
ordinal data, we apply a Fisher Z transformation. (If the original correlations 
are very similar or are very close to .50, there is very little distortion. In such 
cases, it is not necessary to correct for attenuation.) We won't give you the for¬ 
mula for this since we can supply a conversion table (see appendix C, table 10). 
Once you have obtained the correlations, check the table and convert the values 
before placing them in the formula below. 

n r AB 

" 1 + (n - I) r AB 

In this formula, r u stands for the reliability of all the judges' ratings, n is the 
number of raters, and r AB is the correlation between the two raters (if there are 
only two) or the average correlation if there are more than two judges. 
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Let's see how this works. Riggenbach (1989) asked 12 ESL teachers to rate the 
fluency of 6 NNS (taped during conversations with native-speaker friends). Let's 
assume that we had collected similar data on 16 ESL students and asked 6 ESL 
teachers to rate them for fluency. Here is the Pearson correlation matrix: 



Pearson Correlation for Raters 



R1 

R2 R3 

R4 

R5 

R6 

R1 1.00 

.69 .87 

.52 

.75 

.74 

R2 

1.00 .80 

.62 

.57 

.77 

R3 

1.00 

.79 

.81 

.83 

R4 


1.00 

.67 

.61 

R5 



1.00 

.67 

R6 




1.00 

The correlations must 

now be corrected using the Fisher Z transformation table. 


Z Transformation for Data 



Rl 

R2 R3 

R4 

R5 

R6 

R1 

.85 1.33 

.58 

.97 

.95 

R2 

1.10 

.72 

.65 

1.02 

R3 


1.07 

1.13 

1.19 

R4 



.81 

.71 

R5 




.81 

R6 





The average for these correlations would be: 





V 13.89 
* ~~ 15~ 


X = .926 

We can now use these corrected values to obtain the overall reliability. 

n r AB 

" !+(»-» r AB 

6(.926) 

~ 1 + (5 x .926) 

'<,= ■987 

Having used the Z transformation as a correction, we now must change this value 
back to that of a Pearson correlation. Wc turn once again to the chart and find 
that .987 equals a Pearson correlation between .75 and .76. The interrater reli¬ 
ability is marginal. However, if these ratings of fluency were given without set¬ 
ting any predetermined definitions of fluency (hoping there would be a strong 
shared notion of fluency that would result in better agreement), such a correlation 
is not unusual. With a predetermined definition of fluency and with a training 
session for the raters, a much higher correlation would result. 
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Since it is likely that you may need to calculate interrater reliability foi tarings 
let's practice this procedure. 

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

Practice 18.3 


► l. Riggenbach also had teachers rate the fluency of the learners on their mon¬ 
ologue performance. Here are fictitious ratings for fluency. Calculate the 
interrater reliability. 


I 2 

3 

4 

5 

6 

1 .47 

.57 

.44 

.68 

.68 

2 

.40 

.55 

.67 

.69 

3 


.56 

.48 

.78 

4 



.53 

.62 

5 




.65 

The Pearson interratcr reliability with Z transformation was 


. We can, 


therefore, conclude 


2. If your data consist of ratings obtained from judges, compute the interrater 
reliability and report the results below. 


ooooooooooooooooooooooooooooooooooooo 

Internal Consistency 

Internal consistency methods are used when it is not convenient to collect data 
twice or to use parallel tests. There arc three classic methods for calculating re¬ 
liability from the internal consistency of a test: split-half method, Kuder- 
Richarcson 20, and Kuder-Richardson 21. 

For the split-half method, the test data are divided into two similar parts. The 
scores of the 5s on the two halves are treated as though they came from two 
separate tests. If the test items are homogeneous, you might put the even- 
numbered items in one test and the odd-numbered in the second. If the test is 
not homogeneous, be sure to first match items that test the same thing, and then 
randomly assign one of each pair to one test and the other to the second test. 
The correlation between the two halves gives us the reliability for 1/2 of the test. 
When we have the Pearson r for the half-test, the Spearman-Brown Prophecy 
formula determines the reliability of the full test: 

2r AB 

" 1 1 r AII 
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r u is the reliability of the total test. r AB is the correlation of the half tests, whieti' 
we multiply by 2. This value is then corrected by dividing by I plus the corre* 
lation of the two half-tests. 

Imagine that you gave the Peabody Picture Vocabulary Test to two groups of 
bilingual children (Spanish/'English and Korean/English). The research question! 
is whether the scores of the two groups differ. However, prior to answering this 
it is important to establish reliability of measurement. To do this, the responses 
of all the children to half of the test items are assigned to one ''test" and their re¬ 
sponses to the other half to a second "test." Imagine that the obtained correlation 
between the two halves was .87. This is the reliability of the half-test. The 
Spearman-Brown Prophecy formula gives us the reliability of the total test: 

2r AB 


_ 2x .87 
r “ 1 + .87 
_ 1.74 

1.87 --4-; 

t = .93 

The correlation shows us that the test data are reliable. We can feel comfortable 
about continuing on with our investigation of the research question. 

However, let's assume that the correlation shows us that the data are not accurate 
(i.e., not consistent). What can we do? Remember that reliability is derived in 
part by the number of items or, in interrater reliability, the number of judges. If 
we add similar items, or additional judges, the reliability should increase. The 
question is how much more data are needed. (Notice that the total test in the 
example above has more items and that the reliability estimate improves from the: 
half-test to the total test.) We can use the Spearman-Brown prophecy formula 
to determine this. Henning (1987) offers the following Spearman formula to al¬ 
low the researcher to relate test length, or number of raters, to reliability. 

_ nr tt 

r,,n \+(n-\)r„ 

Here, r nn is the reliability of the test when adjusted to n times its original length. 
r tt is the observed reliability at the present length, n is the number of times that 
the length is augmented. For example, if you had used 6 raters and wanted to 
increase reliability by adding 2 more raters, the n would be 1.33. If you had 40 
test items and wanted to add 10 to increase reliability, the n would be 1.25. 

To determine the optimal number of items, or raters, to reach the level of reli¬ 
ability you have selected as a goal, use the following formula: 

l W( 1 - r «) 

nt('-'ttd) 
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n is the number of times that the test must be increased with similar items or 
raters. r ttd is the desired level of reliability. r a is the present level of reliability or 
correlation between two raters or sets of items. 

As an example, assume the Riggenbach data for interrater reliability was not high 
enough to justify confidence. Imagine that there were three raters and the 
interrater reliability were .70. The desired reliability is .80. 

.80(1 -.70) 
n ~ .70(1 - .80) 

n= 1.71 

This tells us that we must increase the number of raters 1.71 times. We must add 
two raters in order to obtain data with a .80 reliability estimate. Of course, we 
could also improve reliability by giving the raters additional training in the rating 
procedure. 


ooooooooooooooooooooooooooooooooooooo 

Practice 18.4 

1. Check the reliability for the monologue data. Set a slightly higher reliability 
level as your goal. How. many raters would you need to add? 


2. Assume that the split-half reliability in the Plann study (page 532) was .75. 
What level of reliability would you hope to achieve for such data? What mea¬ 
sures could you take to increase the reliability?_ 


OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 

A second way to estimate reliability via internal consistency is to use the Kuder- 
Richardson 20 formula. You can think of K-R 20 as giving an average item reli¬ 
ability that has been adjusted for the number of items on the test. (Note: K-R 
20 is equivalent to Cronbach's a coefficient in the case of 0-1, right vs. wrong 
data. Cronbach's a can also be applied to ordinal scale data whereas K-R 20 
cannot. For a real example, you might look at the article by Yopp (1988) which 
presents Cronbach's alpha values for 10 tests of phonemic awareness.) Since we 
don't want to have to run correlations on every pair of items in a test, we can 
shortcut the process by estimating reliability using the following formula: 
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The formula for K-R 20 is: 


K-R 20 = 


rt- 


Here K-R 20 is the reliability of the test, n is the number of items, s} is the vari¬ 
ance of the test scores, and Y.sf is the sum of the variance of all items. 


Imagine that you gave a 9-item test (n in the formula) to a group of 5s. The 
variance of the test scores was 8.00 (the sj in the formula), and the sum of the 
variances of all the items was 1.72 (the sum of sf in the formula). So, the K-R 
20 for the test would be: 

K-R 20 = |-[ 8 Q Q-J- 72 ] 


K-R 20 = 


8.00 

.883 


The K-R 2! formula is also frequently reported in applied linguistics literature. 
For example. Ilyin, Spurltng, and Seymour (1987) used K-R 21 figures to report 
the reliability of six FISL tests. The Kudcr-Richardson 21 formula is_cven simpler 
u> compute. In this foimula, n is the tiumbci »>T items in the test, X is the mean 
of the test scores, and s 2 is the variance of the scores in the sample. 


Imagine that you gave a 100-item grammar test to your students. The test de¬ 
veloper'* have reported high reliability for this test in its brochures. The X for 
your group was 65, and the .* was 10. With this information, it is easy to compute 
the reliability of the test when applied to your 5s. 


K-R 21 = - 


-D- 


-*) 


K-R 21 „ -M-Cl — 65 ~ j 65 j 1Q0) p 


K-R 21 


10 * 

:1 -01D- 65 ^ - n :25 ] 


1000 


K-R 21 = .987 


These fictitious data are highly reliable (perhaps because they are so fictitious!). 
Henning (1987) notes that K-R 21 is less accurate than K-R 20 as it slightly 
underestimates the actual reliability. Both K-R formulas are especially prone to 
distorting reliability when the number of items is small. In this example, the 
number of items (100) is large, so the distortion, if any, should be small. 

The different ways of estimating reliability are obviously not exactly the same. 
If the items on the test do measure the same ability or abilities throughout the 
test, then it makes sense to use one of the internal consistency measures (even 
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though these ate less valued measures uf reliability than test-reiest or parallel test 
methods). 

In our discussion of methods of estimating reliability we have mentioned a num¬ 
ber of characteristics of the test itself which may influence reliability. Let s 
summarize them here. 

1. The .onger the test, the more reliable it will be (assuming it's not so long that 
fatigue sets in!). 

2. If the task items are too easy or too difficult, reliability will be low (scores 
will ail group together at high or low points). Consult your advisor for alter¬ 
nate ways of estimating reliability. 

3. Items which discriminate well among 5s will contribute to test reliability. 

4. If 5s have a wide range of ability, test reliability will increase. 

5. The more homogeneous the items, the higher the reliability. 

6. Reliability estimates are better on "power" tests than on "speeded" tests (don't 
use a split-half method comparing the first half with the last half if speed is 
a factor!). 

Test length is probably the most important test characteristic in terms of in- 
ereasing reliability. The more data you have, for example, on your 5s perfor¬ 
mance on present perfect, the surer you can be that the performance is reliable. 
Of course, an increase in number of items will result in improved reliability only 
up ti* a point (which you can determine using the Spearman formula), from then 
on. the reliability will not change as more items are added, it also makes no sense 
to add items if it means that 5s become bored or exhausted. Trying to increase 
reliability by constructing many items may not work. When testing young chil¬ 
dren, or when it is likely that 5s will become frustrated by a task, it is always a 
good idea to schedule plenty of "rest periods" to be sure that you get consistent, 
high-level performance. 

One final note on reliability of data—the reliability formulas given here were de¬ 
vised for data where there is variance. They obviously should not be used with 
criterion-referenced tests (since these tests are devised to get high scores from all 
5s and, thus, little variance in performance). If the data you have collected come 
from a criterion-referenced test, consult Thorndike 0971). Hudson and Lynch 
(1984) provide an excellent discussion of ways to estimate reliability and validity 
of such tests. The article contains a good example of application of these tech¬ 
niques to a sample data set. 


Validity of Measurement 

Researchers know that if a test measure is not reliable it cannot be a valid mea¬ 
sure. However, validity is almost always couched in terms of valid for something 
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else. An instrument is valid for the purpose of testing reading, a test is valid for 
some certain group of advanced learners, or it is a valid test for children. 

Nevertheless, there are different types of validity, three of which arc most central 
to applied linguistics research. The first of these is content validity. Content va¬ 
lidity represents our judgment regarding how representative and comprehensive 
a test is. Our data-gathering procedures arc usually selective. That is, we can't 
look at all of written language in doing a text analysis. We select a representative 
sample-say comparative structures-in a written language data base. In a 
grammar test, we can't test every grammar point-thcre isn't time. We select a 
representative sample for test purposes. If we study speaker overlap in natural 
conversations, we can't look at every overlap in all conversations of learners. 
We have to select a representative sample of the data for analysis. Not only 
should the data we select for analysis be representative of the phenomenon we 
wish to research, but they must be comprehensive. Thus, in the grammar exam¬ 
ple, the items selected for test purposes must not only be representative but cover 
as wide a range of grammar points as possible. A grammar test that only tests 
verb morphology is not a valid grammar tes: in terms of the comprehensiveness 
of its content. If the data selected for overlaps relate only to greetings and 
leave-takings, the measurement is not comprehensive and lacks content validity 
(if the test purports to measure overlaps in conversation in general). In the same 
vein, if a text analysis of comparatives only selects samples related to more and 
most and neglects the -er. -est forms, (he data are not comprehensive and so the 
measures lack content validity. 

Content validity, then, has to do with how well a test or observation instrument 
tests what it purports to test. The key elements in the judgment are 
representativeness and comprehensiveness. I here is no statistical measurement 
of content validity. Rather, panels of experts may be asked to rate the 
representativeness and comprehensiveness of each part of a test. They may be 
asked to draw up specifications so that test items can be written that will be 
representative and comprehensive in nature. 

Face validity relates to content validity. However, face validity has more to do 
with how easy it will be to convince our 5s, our peers, and other researchers that 
a particular test actually measures what we say it measures. For example, it may 
be difficult to convince some teachers or administrators of the face validity of a 
eloze passage test as a measure of reading comprehension. Somehow, since it is 
not in the traditional form of "read this passage and answer these questions based 
on the passage," it does net have the same acceptance, the same face validity as 
an ordinary reading test. 

Predictive validity refers to the use of tests as valid for the predictive purposes. 
If a student does well on a general language proficiency test, we expect that we 
can predict how well the student will do in a variety of nontest situations. Con¬ 
versely, we expect that if a student does poorly on a language proficiency test, this 
can be used to predict nonsuccess in other areas. For example, you may have 
noticed that students applying to foreign universities are asked to give informa¬ 
tion on their abilities in the language of instruction at the university. People who 
write reference letters are often asked about the language proficiency of students 
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wishing to study in foreign universities. 1 he assumption is that language profi¬ 
ciency can predict at least to some degree success in academic life in general. 
Researchers have been interested in how well ESL placement exams (c.g., 
TOEFL or ESLPE) can predict grades (GPA) of entering foreign students. If the 
ESL tests have good predictive validity, we would expect that they might be very 
good predictors at the lower proficiency levels. Henning (1987) suggests that as 
students become more and more proficient, the relationship between test scores 
and GPA should weaken. That is, hypotheses on the predictive validity of the 
ESL test in terms of general academic performance could be made and tested. 

A third type of validity has been mentioned several times in this manual: con¬ 
struct validity. We have demonstrated that there arc many constructs in our field 
for which we have no direct measures-constructs such as motivation, need 
achievement, attitude, role identification, acculturation, communicative compe¬ 
tence, and so forth. We can, however, say that someone who is highly motivated 
in language learning will exhibit traits a, b, and c for which we may have direct 
measures. The motivated 5 will not exhibit traits x, y, and z for which we have 
direct measures. Once we have formed our hypotheses regarding the construct, 
we can then test these predictions in our research. If our predictions hold, wc can 
say that wc have a valid construct which was defined by theory and tested in our 
research. 

Another kind of construct validation is more closely tied to second language ac¬ 
quisition research. In a series of theoretical papers, SLA researchers have sug¬ 
gested that there are a number of components that make up "cominunicati\e 
competence." If there are direct measures for the components, then we should 
be able to test the construct indirectly. In addition, if it can be shown that a 
person who is competent exhibits traits a, b, and c anti not x, y, anti z, then we 
can begin to establish the reality of the construct itself. You might think of this 
as analogous it) establishing the validity of a construct such as listening compre¬ 
hension. If the items on a listening subtcsl correlate (point biserial correlation) 
with the listening subtest score better than, say, with a vocabulary subtest score, 
then we have helped to establish the construct validity of listening comprehen¬ 
sion. In the same way, construct validity can be established for many constructs 
for w'hich we have no direct measures. Since there arc so many such constructs 
related :o language learning, construct validation is important to our field. 

As was the case with reliability, there are many threats to measurement validity. 
Henning lists five major threats to validity. In most cases, it takes only a few 
minutes of planning to avoid these problems. 

1. Invalid application of tests. For example, you might select a Spanish reading 
test to assess the LI reading skills of a group of Spanish-English bilingual 
children in the American Southwest. Later, someone points out that the 
Spanish test was written in Spain with cultural content related to that coun¬ 
try. Most of the children come from Mexico and Central America—the test 
has been misapplied. There are many possibilities for misapplication of tests: 
tests written for native speakers given to immigrant students, tests written for 
adults administered to children, vocabulary tests given as tests of reading, 
and so forth. The tests are not valid for these particular purposes or groups. 
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2. Inappropriate content. Students may not know the vocabulary used in test 
items meant to test other skills. For example, students may be given a read¬ 
ing test where the passages are about physical sciences even though the stu¬ 
dent has never been exposed to such content. The content is not valid for the 
purpose of testing reading. 

3. Lack of cooperation from the examinees. When the people who contribute the 
data have no stake in the data, it's likely the data will not be valid. If the 
participants view the exercise as a great waste of time, and you hope to use 
the results to decide who should or should not be selected for a special lan¬ 
guage program, the data will not be valid for your purposes. 

4. Inappropriate norming population. It's possible that you will see a test refer¬ 
enced as both valid and reliable. For example, UCLA uses a placement ex¬ 
amination which is believed to be valid and for which we have obtained 
consistently high reliability. The 5s who take this test are from many differ¬ 
ent countries but primarily from Asian countries. These 5s may be newly 
arrived students or students who have lived in the United States for several 
years prior to admission to the university. Imagine, then, taking the test 
normed on this group (diverse in some respects) and using it as a valid 
placement test at the University of Copenhagen. We have no way of knowing 
whether the test would be valid for placement purposes in a different area of 
the world where all the 5s are from one language background. 

5. Invalid constructs. We have talked about this problem many times in this 
chapter and elsewhere. We need to be careful in our operational definitions 
of constructs such as motivation, success, fluency, intelligibility, and so forth. 
We must be able to show that our measures are valid as tests of these con¬ 
structs. 


Validity of Research 

Whether planning our own research or reading research reports of the field, we 
should think about the project from at least two perspectives. First, what can this 
particular piece of research contribute to our understanding of language acquisi¬ 
tion theory or language teaching practice? The potential for advancing theory or 
practice is our main guideline in evaluating research. That potential, however, 
can be reached only if the research is carried out in a way that allows us to feel 
confident in the results. 

The issue of confidence has been central, lurking behind all the discussions and 
problem solving activities in this manual. We want to carry out our research in 
such a way that we all will feel confident in claims made on the basis of the re¬ 
sults. This depends on the answer to at least three questions: 

1. Is the design valid? 

a. Internal validity of design: What threats are left unsolved? 
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b. External validity of design: What threats are left unsolved (especially 
stratified sampling, random selection, and random assignment)? 

2. Is the measurement valid? 

a. Do operational definitions match the construct? 

b. Are the measures representative and inclusive? 

3. Is the measurement reliable? 

a. Is the measurement internally consistent (split-half, K-R 20, K-R 21)? 

b. Is the measurement consistent over time (correlation)? 
c Is there consistency in form (correlation)? 

d. Is there consistency in judgment (interrater reliability)? 

Our ability to describe and, beyond description, to generalize depends in large 
part on the answers to these questions. 

We have already said that computers can never judge data-they know not 
whence the data came. The same can be said about any statistical test. Once 
we move over to selection of a statistical test to test our hypotheses, we assume 
that the answers regarding the source of the data have been satisfactorily an¬ 
swered. All statistical tests share the assumption that the data are both valid and 
reliable. There are, however, specific assumptions related to each procedure. 
We have talked about these as we discussed each procedure in turn. In order to 
help you select the appropriate procedure and to check the assumptions of each 
test, we present a flow chart and a list of assumptions in the following sections. 

Guide (o Selecting the Appropriate Procedure 

Once you have determined reliability of your data, you still must select the ap¬ 
propriate procedure for analyzing the data. The assumptions listed for each 
procedure may help you determine the appropriate statistical procedure. At se¬ 
veral points throughout the manual we have given you pointers or asked that you 
draw your own chart to help guide you in the selection process. 

The following flow chart includes the statistical procedures covered in the man¬ 
ual. With care, the chart may be used as a guide in the selection process. Once 
a statistical procedure has been selected, check the assumptions for that proce¬ 
dure in the following lists. 
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| How will you Interpret the results? | 

Check that the coefficient of Interpret scale in light of reasonableness of the cutoff point, number of instances required, 
scalability is over .60 missing values, and context used to elicit forms. 


Compare the groups If the z score is significant, then one group has more Ss in higher ranks than the other . 

Check z value Use eta" for strength of association. 


Compare the groups If H is significant, the groups differ; use Ryan's procedure to locate which groups differ. 

Check// value Use eta 1 for strength of association. 


Compare the groups 
Check For rvalue 
Compare the groups 
Check X 2 


If the Sign test/? or Wilcoxon z is significant, there is a change from lime 1 to time 2. 

Use eta* for strength of association for Wilcoxon. 

If X*i* significant, there is a change over several time points (or msrs.) To locate the difference 
more precisely report the results of the Nemenyi's procedure. Use eta* for strength of association. 


Compare the two means If the x value is significant, the two groups differ. Use eta 1 to show strength of association. 

Check r value 

Compare 3+ means If the F ratio is significant, the groups differ. To locate the difference more precisely, interpret 

Check F ratio the multiple-range test (Schefff, Tukey. or Newman-Keuls). Use omega? or eta 1 for strength of association. 


Compare the two means If the t value is significant, there is a difference in the means for the two times or measures. 

Check t v ahie Use eta? for strength of association. 


Compare 3+means If the F ratio is significant, the same (or matched) Ss perform differently on repeated measures. 

Check F ratio Use a multiple-range test to locate precise differences. Use eta 1 for strength of association. 


Check the strength of 
each correlation 


Step 1, If the interaction is significant, chart the means to show the interaction 
and interpret it. Interpret main effects in light of the interaction. 

Step 2. If the interaction is not significant, interpret the difference in the main effects. 
Use a multiple-range test to locate precise differences. Use eta 1 to show the strength of 

r 1 shows the amount erf overlap between each pair of variables. 

Be sure to correct for attenuation if measures arc not of equal reliability. 


Check the probability of If the correlation is significant, it shows that the H t of no relation can be rejected, 

the correlation Interpret the value "sensibly" in terms of strength of relationship. 


Check the value of the 

Check X J for significance 

Report predicted scores 
Check the SEE 


Explain the correlation in a “sensible” way. 

Explain the correlation in a "sensible" way. 

The stronger the correlation and the smaller the SEE, the better the prediction will be. 


Check each added variable Identify the first independent variable, then the overlap of the second with the first to see how much 

each contributes (as well as their joint contribution) to explain variance in the dependent variable. 
Explain how much additional infonnation is given by each succeeding independent variable. 

Check each factor loading If possible, once the number of factors has been determined, label each factor by consulting variables 

with high loadings vs. variables with low loadings on each. Else, label them as factor A, B, C, etc. 


Check solutions and stress 


Check X 2 value & (O-E)TE 

Check z value 


Once a solution (about number of dimensions) has been identified or selected, label each dimension by 
consulting items in the cluster and those distant bom cluster. Else, label them as dimension A, B, C, etc. 
If X 2 is significant, the distribution differs from the expected distribution. 

Show which cells differ most from expected cell frequency or do a Ryan's procedure 
to locate the difference more precisely. Use Phi or Cramers V for strength of association. 

If i is significant, conclude there is a change in proportion of 5s from time 1 to lime 2. 


Check parameter estimates to The parameter estimates show which interactions and main effects are significant, 
reduce model, compare models To “pare" the model, compare various models with the saturated model. Decisions 
should be based on statistical and substantive arguments, 

Check the paths to see which Use the analysis to trim “paths” from the model. Intcipret the findings on both 
can lie trimmed fioiu die model statistical ami substantive grounds. 
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Assumptions of Specific Statistical Tests 


All statistical tests have certain assumptions underlying their formulation. When 
the assumptions are violated, it is difficult for the researcher or the reader of the 
research to know how much confidence to place in the findings. If one assump¬ 
tion is ignored and the statistical test applied, the probability is distorted. If more 
than one assumption is violated, or some combination of assumptions is violated, 
the resulting probability level is meaningless. 

The kinds of assumptions differ depending on the kinds of claims researchers 
hope to make. When the researcher uses a statistical test for inferential 
purposes--i.e., when the researcher wishes to generalize--then the basic assump¬ 
tions of the test must be met, and, in addition, good design must be employed. 

f or example, in using a statistical study for inferential purposes, we make the 
following assumptions: 

1. The population is well described and a representative sample is drawn from 
the population. 

2. A (stratified) random sample is drawn from the population. 

3. .Vs (or observations) are randomly assigned to treatment groups. 

1. I hreats to internal validity of the study h ive been met. 

5. Threats to external validity of the study have been mcL. 

In our field, wc seldom are able to meet all of these requirements. Therefore, 
when parametric statistical procedures are used, they are used primarily for de¬ 
scriptive purposes. They give us confidence '.hat what we say about the sample 
is correct (rather than confidence that wha: we say about the sample can be 
generalized to the population at large). In using parametric procedures for de¬ 
scriptive purposes, we also do not make causal claims. Even for inferential pur¬ 
poses, we cannot say that we have "proved" anything, but rather, that we have 
found evidence in support of (or contrary to) our hypotheses. 

Nevertheless, whether we use statistical procedures for inferential purposes or 
descriptive purposes, we must meet the assumptions that underlie the statistical 
test. In some cases, this relates to normal distribution; in other cases, it may be 
the number of observations or Ss required; and in other instances, it may have 
to do with whether the observations are from different Ss or from the same Ss 
at different times. 

Once you have selected a statistical procedure, you should always check the as¬ 
sumptions that underlie the test. In general, many tests can suffer from a small 
n size because a single outlier or a few outliers will influence the results. That is 
why n size appears again and again in the suggestions below. If you cannot meet 
the assumptions of a statistical test, then it is better not to apply the test. If you 
go ahead and run the procedure and report the results, the results must be qual- 
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ified so that readers can make up their own minds whether they feel the circum¬ 
stances warrant the exceptional use of the test. 

Since we want to use as much information in the data as possible, our first choice 
is for a parametric procedure. When we cannot meet the assumptions of these 
procedures, however, we can turn to alternative nonparamctric procedures. The 
following lists reflect this bias. Please consult the appropriate chapter for a full 
discussion of the assumptions underlying each test. 


t-test: Comparing Two Means 

1. The data are scores (or ordinal scale data appropriate for parametric proce¬ 
dures). 

2. The data are independent (between-groups design). 

3. The data in the two samples are normally distributed so that X is an appro¬ 
priate measure of central tendency. 

4. The distribution of the data in the respective populations is normally distrib¬ 
uted with equal variances. 

5. Only two means are compared (no cross-comparisons allowed). 

6. The t statistic shows if an effect exists, not its strength. 


Solutions when assumptions cannot be met. 

1. For assumption 1 (and 3 and 4), use a Rank sums test or a Median test in¬ 
stead. (If you have 40 or more 5s, you can also use the Komogorov-Smirnov 
test.) 

2. If the data are from the same 5s or highly correlated samples (i.e., matched 
groups), use a matched r-test. If the design is repeated-measures and you 
can't meet the normal distribution and equal variance assumptions, use a 
Wilcoxon Matched-pairs signed-ranks test or (last resort!) a Sign test. 

3. If the data are not normally distributed (the X is not a good measure of cen¬ 
tral tendency), use a Rank sums or Median test for a between-groups com¬ 
parison or a Wilcoxon Matched-pairs signed-ranks for a repeated-measures 
comparison. 

4. If you run the /-test on the computer, you can check for equal variances in 
the printout. If calculated by hand and the design is balanced, you can as¬ 
sume they are equal. If the variances are not equal, use the nonparametric 
tests listed above and compare the results. Or, add data to make a balanced 
design. 

5. To compare more than two groups, use a test which allows for such compar¬ 
isons (e.g., ANOVA, GLM, or Friedman). 
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>. Report eta 2 for strength of association. 


4NO FA: Comparing More Than Two Means 

.. The data are scores (or ordinal scaie data appropriate for parametric proce¬ 
dures). 

1. The data arc independent (between-groups designs). 

1. The data in the samples arc normally distributed so that X is an appropriate 
measure of central tendency. 

1. The data in the respective populations are normally distributed with equal 
variances. 

). The design is balanced (unless specific ANOVA procedure adjusts for unbal¬ 
anced nonorthogonal designs). 

3. There arc a minimum of 5 Ss or observations per cell. 

7. The F statistic allows us only to (not) reject the null hypothesis (i.e., it shows 
the groups differ but not where the difference lies). 

$. The F statistic show's if an effect exists, not its strength. 


solutions when assumptions cannot be met. 

1. If data arc not interval, use a Kruskal-Wallis or Friedman test. 

1. If the data are not independent, use a Repeated-measures ANOVA or GLM. 
If it is a mixed design, use Factorial ANOVA or consult a statistician to de¬ 
termine which member of the ANOVA family is appropriate for the data. 

1. If data are not normally distributed, use a nonparametric test (Kruskal- 
Wallis or Friedman). 

I. Apply a nonparametric procedure (Kruskal-Wallis or Friedman) if the 
distribution in the population cannot be assumed to be normal. 

>. If the design is not balanced, use SAS procedure GLM, and consult the Type 
III results. 

5. Collect additional data to meet minimum cell size. 

J. When the / r statistic is statistically significant, use a multiple-range test (e.g., 
Scheffe, Tukey, Newman-Keuls) to determine which groups differ from each 
other. 

i. Report omega 2 as the strength of association. 
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Chi-square: Relating Categorical Variables 

1. The data are frequencies. 

2. The data are independent. This means, first, that each observation or 5 ap¬ 
pears in only one cell. The comparisons arc bctween-Ss or bctwccn-tcxts; i.c., 
no repeated-measures. Second, the data within each cell are independent. 
Each 5 or observation contributes just one piece of information. 

3. Correction factors are applied to tables with 1 df. 

4. Expected cell frequencies are > 10 for tables with 1 df. Expected cell fre¬ 
quencies are > 5 for all other tables. 

5. In the report of results, the observed yf value allows us only to (not) reject the 
null hypothesis (not to locate precisely where the difference is). 

6. The results show a relationship among the variables but not the strength of 
the relationship. 


Solutions when assumptions cannot be met. 

1. If the data arc continuous, use the appropriate correlation (Pearson or 
Spearman) to show relationship. Or, if the data are categorical and can be 
appropriately converted to raw frequencies, do so. 

2. For non independent data between cells (i.e., repeated-measures), use 
McNemar's test or CATMOD. For nonindependence of the data within cells 
(i.e., individual 5s or texts contribute different frequencies to one cell), con¬ 
vert the data as shown on page 407 of this manual or convert the data to rates 
(X per 100 or whatever) and apply an appropriate nonparametric procedure. 

3. For l df use Yates' Correction factor. 

4. For small cell sizes, collapse cells or use a Fisher's Exact test (see Siegel, 
1956). 

5. To locate differences when the y 2 is significant overall, interpret the larger 
(O — Ef -f- E values, or do a Ryan's procedure to locate differences among 
cells. 

6. Use phi (d>) or Cramer's V for strength of association. 


Pearson Correlation: Relating Interval Variables 

1. The two data sets are measured with scores or ordinal scales that are truly 
continuous. 

2. The two data sets have equivalent reliability. 

3. The data are independent (i.e., no uncorrected part-to-whole correlations). 
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The data on both measures are normally distributed. To obtain a normal 
distribution, n size should > 30 or 35 for each. 

a. The range of scores on both variables is "large enough"; i.e., scores are not 
grouped tightly together with little range. 

b. There are no extreme scores at each end of the range with little in the 
middle of the distribution. 

The relationship is linear, not curvilinear. 

Checking the probability of any correlation coefficient simply allows us to 
reject the null hypothesis of no relationship. The probability does not reflect 
the strength of the relationship. 

Pearson correlations used for interratcr reliability with more than two raters 
will likely be distorted. 


olutions when assumptions cannot be met. 

When data are not score data or when not normally distributed: 

a. Use point biserial correlation when one variable is a dichotomous (as in 
pass/fail) nominal variable and the other is score data. 

b. Use Spearman or Kendall tau correlations for rank-order data. Be sure 
to check number of ties when using Spearman if you do the procedure by 
hand (computer programs correct for ties). 

c. Use phi for two categorical variables or eta where one is a category and 
the other score. 

d. Use Kendall's Concordance when checking agreement in rank orders 
across many cases (rather than just two). 

If the tests being correlated are not equivalent in terms of reliability, correct 
for attenuation. 

. Remove the contribution of the part from the whole before correlating part 
and whole. 

, When data are not normally distributed, use Spearman or gather additional 
representative data. 

. Consult a statistical consultant when the scatterplot shows a curvilinear re¬ 
lationship. There are special methods for curvilinear data. 

Interpret the strength of correlation with r 2 (rather than the probability level). 
Remember that correlations of tests are to some extent dependent on the re¬ 
liability of each test. This is not reflected in a correlation matrix which has 
not been corrected for attenuation. 
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7. Use a Fisher Z transformation to correct distortion in interrater reliability 
(when there are more than two raters). 


Regression: Predicting Y from X 

1. Since regression is related to Pearson correlations, the assumptions of the 
Pearson must be met. 

2. The correlations must be accurate (corrected for attenuation). 

3. Check that there is no multicolinearity. 

4. A minimum of 30 to 35 5s or observations per independent variable is re¬ 
quired. As a general rule, regression should not be done with n sizes below 
200 . 


Solutions when assumptions cannot be met. 

1. If data are ordinal and a Spearman correlation was performed, talk with a 
statistical consultant to determine whether the data are appropriate for re¬ 
gression. 

2. Correct for different test reliabilities (correction for attenuation). 

3. Collapse or omit variables which overlap above .80. Or consult a statistician 
for assistance. 

4. Don't perform regression with small n sizes. Collect additional data instead. 


PC A, Factor Analysis, and Multidimensional Scaling 

1. Since these procedures build on Pearson correlations or,, in the case of MDS, 
on Spearman, the data must meet the assumptions associated with these cor¬ 
relations. 

a. For principal component and factor analysis, check all the assumptions 
of Pearson's correlation. 

b. For multidimensional scaling, check either the assumptions of Pearson or 
Spearman, depending on which is used as input to the MDS procedure. 

2. Normal distribution is best assured by having approximately 35 5s or obser¬ 
vations per variable. These are not small n-size procedures. 

3. Cutoff points and interpretation of factors or dimensions (or dusters) must 
be guided by substantive arguments as well as statistical evidence. 
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Solutions when assumptions cannot be met. 

1. If data are nominal, use a categorical (loglinear) modeling procedure. 

2. If the n is too small, add more data. If this is impossible, interpret the study 
as exploratory in nature and warn readers that even this exploration must be 
considered tentative. 

3. When interpretation is difficult, discuss the issues with your advisor, statis¬ 
tical consultant, or a member of the testing or research special interest groups 
ofTESOL. 


Path Analysis 

1. Path analysis has the same assumptions as regression. Review each of these 
carefully. 

2. The variables are linear, additive, and causal (i.c., no curvilinear relations, 
variables are independent, and logic or previous research/theory argues that 
there is a causal direction to the relationship). 

3. All relevant variables are included in the system. 

4. There is a one-way causal flow in the path diagram that is being tested. 

5. And, as with regression, the data are normally distributed and the variances 
are equal. Again this is not a small sample technique. To approximate 
normal distribution with equal variances, be sure the sample size is as large 
as possible. 

6. Before causal claims can be made (beyond the sample), be certain that the 
data arc randomly selected and are representative of the population, and that 
all threats to internal and external validity have been met. 

Solutions if the assumptions cannot be met. 

1. If the data are frequency distributions rather than interval data, use a 
loglinear path analysis. 

2. If the data are recursive, consult a statistician for help. 

3. If the sample size is insufficient and the analysis is performed as an explora¬ 
tory test, do not make causal claims. Caution the reader that this is a pilot 
study and attempt to collect additional data. 

4. If there are design faults that threaten external validity, caution the reader 
that causal links apply only to the data sample. 
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Categorical Modeling 

1 . The data are for nominal, categorical variables (with two or more levels). 

2. The data are representative and randomly selected. 

3. The particular procedure matches the design. 

a. Between-groups designs: loglincar or CATMOD 

b. Path analysis: loglincar 

c. Repeated-measures designs: CATMOD 

4. When using loglinear to test effects, interpret interaction before main effects 
(as with ANOVA). When using the procedures to test competing models, 
interpret the differences in models based on both substantive and statistical 
arguments. 

Solutions when the assumptions cannot be met. 

1. If data are interval, use PCA, factor analysis, or multidimensional scaling. 

2. When interpretation is difficult, consult your advisor, statistical consultant, 
or a member of the testing or research special interest groups of TESOL. 
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Conclusion 


Applied linguistics, at our university, is part of the division of humanities. Hu¬ 
manities, by its nature, includes all the fields that help to define what it means 
to be human. For most people, music, art, and dance arc important, but it is 
language and language use that is central to that definition. Whether the lan¬ 
guage is that of child or of poet, it helps to define what we are. 

Within humanities there are many fields, and each looks at language in a differ¬ 
ent way. Linguists see their research agenda, in part, as describing languages in 
a way that will reveal something of the language faculty of humankind. The so¬ 
ciologist looks at how society uses language to carry out oral and written com¬ 
munication and how that communication is shaped by social groups. Language 
use is also a concern of anthropologists who document how groups and individ¬ 
uals within groups use language to carry out their roles and organize their worlds. 
Developmental and cognitive psychologists have as one of their central concerns 
the connections between language and cognitive and social development. For 
many psychologists, the gift of literacy is as puzzling as language acquisition it¬ 
self. The cognitive processes underlying reading and composing come to define 
humanness and creativity in parallel to that of the spoken language. The psy¬ 
chologist, too, is intrigued by the way we "turn words"—for example, the use of 
everyday metaphor—to form and transform our realities. 

For a very large portion of the world, bilingualism and even multilingualism is the 
expected state of affairs. If language—and the arts-together with cognitive and 
social knowledge are defining qualities of humans, then these qualities expand in 
all directions as new larfguages, new social knowledge, and new ways of organiz¬ 
ing cognitive concepts are added. 

Our research agenda in applied linguistics overlaps with linguistics in that we too 
are interested in descriptions of language that brighten our understanding of 
language acquisition. We share the interest of sociologists and anthropologists in 
understanding the organization of talk and written communication in social 
groups. Those of us working on child bilingualism and second language acquisi¬ 
tion share the same interests as those of developmental and cognitive psychol¬ 
ogists. We want to understand not only the patterns of early second language 
development but also the creative use of language by bilingual and multilingual 
writers. Here our interests merge with those of literature. 

Since the interests of all these fields overlap, what is it that makes ours different? 
Certainly the major difference must be our interest in teaching languages. As 
we learn about language acquisition outside the classroom, we wonder about 
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inguage acquisition within the classroom. How do they differ? What can we 
o to promote the process? As we learn about language use within social groups 
whether these groups are in academic disciplines, in immigrant neighborhoods, 
i the factory or workplace), we wonder how we can transform the classroom to 
jach communicative competence and appropriate register variation. When we 
ompare the rhetorical organization of genres across languages, we think about 
he comparison in terms of better ways to teach composition and reading in our 
lasses. We can begin to understand the learning preferences of our students if, 
nth the anthropologist and sociologist, we investigate how language is used to 
rganize the world of the classroom. However we define applied linguistics, one 
■art of our research agenda must be the question of how best to help learners 
cquire the social-cultural knowledge of a new ethnic group and the language that 
>uts that knowledge to use. 

Ve have said that questions are "interesting" in our field to the extent that they 
iform theory (e.g., theory of acquisition, theory of reading, theory of language 
se) or practice (e.g., language teaching, language planning and policy, materials 
evelopmcnt, language testing, and so forth). 

'he problem is that it would be foolish to think that we can address theory 
without practice or practice without theory. One affects the other. We have, in- 
leed, urged you to simplify your research efforts. If we think of the complexity 
if such questions as, What is it that makes us human? you can understand why 
□any scholars want to look at language as separate from social knowledge, sep- 
rate from other types of cognitive knowledge, separate from language use, sepa- 
ate from language instruction, and separate from the issues of bilingualism or 
riultilingualism. You can understand why many educators want to look at 
ilingualism or multilingualism as only a cognitive issue or only a social issue. It's 
asy to understand why we, as teachers, want to see the question in terms of 
tetter teaching materials, new ways of organizing tasks, and ways of finding ac- 
ivities that focus on "just" language or "just" cognitive problems or "just" social 
organization. We all know this can't be done, but we struggle to do it because the 
ask is so overwhelming otherwise. 

)n the other hand, life, for most people, is an attempt to answer the most com¬ 
plex question of all, what it means to be human. It is human nature to be curious 
bout ourselves. So, since we are constantly asking these questions anyway, we 
re fortunate to be applied linguists. We get paid for looking for our own an- 
wers! The search for answers on the personal level, however, is usually quite 
lifferent from that at the professional level. One difference is that as profes- 
ionals we accept a framework for the search. There are, of course, many differ- 
nt frameworks to choose from within our field. Whichever we select, it is the 
ramework which gives us confidence in sharing the answers we find. 

Ve have said that research is the organized, systematic search for answers to the 
[uestions we ask. Because it is systematic, we have greater confidence in our 
indings. When we have thought about all the possible threats to internal and 
xternal validity and adjusted our search to deal with as many of these as possi- 
»le, we have greater confidence in our findings. Statistical procedures are useful 
ools to increase that confidence. If w T e use them appropriately, we can feel con- 
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fident that our descriptions and results are correct. We know when we can and 
cannot generalize our answers beyond the people, the place, and the time frame 
in which we searched for answers. 

In our personal lives we each probably have a Grandmother Borgese--that 
mythical someone who continually says "That I knew already," "I don't believe 
it," and "Who cares anyway?" as we struggle with questions which are important 
to us. By organizing our search, we will be better equipped to answer back. 

A year after visiting the school where we hoped to involve teachers in research, 
we received a group of papers from some of the students at the school. Not only 
did these teachers have questions, but they had stimulated the curiosity of their 
students as well. For some of these students, problem solving became a solitary 
research effort: 

I thought about it last night in my dream I desided what to do. I ask 
10 4th grades in Mrs. A class and I wright it down. I ask 10 people 
at my house and / wright it down. I ask 9 teashers and 1 lady in the 
off is ( = the principal). I writ it and I count it and 1 can tell the an¬ 
swer. I will draw a piechart for the answer. 

For others it was a group effort: 

We worked hard for our book rerort ( =book report). Their was too 
many circles. What we are going to do. First we are going to Jose 
is going to count the red words. Then we are going to Cindy write 
the red words. Nix we are going to we study red words in spelling. 

Then we are going to Carlos read the words. Then we are going to 
see the words. We put stickers for good. Then we are going to count 
the words and see. Then we are going to write the book rerort for 
you. 

Systematicity in answering questions is alive and well in the world! At times, we 
may feel that the questions (how to get rid of "red words") are trivial in view of 
our larger questions. But each Fits a place in our lives, in our theories, and in our 
understanding at a particular time. 

At this particular time, we expect that you are better prepared to evaluate the 
evidence presented in applied linguistics research. We also believe that you are 
better able to undertake research yourself. We hope that you now keep a re¬ 
search journal, and that you have refined your broad research interests into fea¬ 
sible questions. You should be able to design a project with minimal threats to 
internal--and possibly even external-validity, a design that allows you to search 
for answers in appropriate ways. Finally, we trust that you can gather reliable 
data, analyze them appropriately and interpret the findings wisely. If you can 
do all these things, you have more than met the objectives of this course. At the 
next TESOL, AILA, AAAL, or SLRF conference, you should feel confident in 
saying, "I have an interesting question, I know one way to answer it, and 1 am 
willing to share that answer with you." The researcher who can say this need 
never again worry about Grandmother Borgese. 
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Appendix A 
Pretest 


This pretest is a diagnostic to help you discover which sections of the Manual 
contain material you should review. The first section asks you to identify statis¬ 
tical procedures appropriate to a given research question. The second section is 
a "vocabulary test" meant to check your familiarity with basic terms used in 
dala-based research. 

Name the statistical procedure(s) needed to analyze the data described in each 
of the following items. 

1. You borrowed data from 30 SLA researchers who had done longitudinal re¬ 
search on 1 5 each. You want to know if there is any order in which these ESL 
learners acquire Wh-questions (what, where, who, when, how, why, etc.). (Chapter 

V) 


2. As a second part of the study, you want to check to see whether the order you 
found (in item 1 above) compares to data in a cross-sectional study. You add a 
caboose (a short extra subtest) of Wh-questions to the placement exam at your 
school. (Chapter 15) 


3. Compositions of 50 entering ESL freshmen with SAT verbal scores between 
250 and 300 were read and 5s were referred either to Freshman ESL or regular 
English writing programs. Using strict scoring guidelines, three teachers rate each 
composition for a possible total of 25 points. You need to report the reliability 
of scoring. (Chapter 18) 


4. At the end of a quarter, the 5s in item 3 above (30 were in ESL; 20 in writing 
programs) again wrote a final composition (assume topics were equivalent). 
These were scored the same way. You want to know if both/either program(s) 
produced results (and/or if one worked better). (Chapters 9 and 10) 


5. Assume that the difference between treatments in item 4 wasn't significant 
because there was so much variability within each group. Still, it appeared some 
5s did improve more than others irrespective of treatment. Therefore, you de¬ 
cided to forget about treatment altogether and instead determine whether LI 
membership would account for this. 
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Vou group the ESL 5s into five major LI types. (Chapter II) 


5. You decide major might also be important so you divide them into 
jcicncc/humanities/undccidcd and add that to LI (in item 5). (Chapter 13) 


7. Grades (A to D) were given to all 5s in advanced ESL classes. You wondered 
f attendance at tutorials would be reflected by higher grades for those who par- 
:icipated. You obtain a list of 5s who attended the tutorials so that you have 
information in the form of "never—occasionally--rcgularly--very often" for each 
S'. (Chapter 14) 


i. You have data from 30 child ESL learners who mix/switch languages while 
speaking with a bilingual researcher. You have classified the switching as 
lexical/phrase level/sentence level switches. For each 5, you convert the fre¬ 
quencies to proportions: % lexical, % phrase level, and % sentence level 
switches. You believe children primarily use lexical switches. (Chapter 12) 


9. You believe the relationship between need achievement and foreign language 
proficiency should be obvious to all. It isn't, though, so you collect data to try 
to demonstrate this. You also collect data from each 5 using the Eysenck scale 
of personality factors. There are 32 personality traits that you hope to reduce to 
two factors-introvert/extrovert and stable/unstable. You also have data for each 
5 on need achievement, and field independence. You want to know how these 
variables relate to language proficiency and which (or what combination) best 
predicts the "good language learner." The data are from 450 ESL 5s entering 
your university. (Chapter 16/17) 


10. In preparing a "culture course" for a program in China, you ask Americans 
in the United States and Chinese 5s in China to complete a values survey. In one 
part of the survey 5s rank-order the importance of 20 values statements. You 
want to know whether there is agreement within each group on the importance 
of shared values and whether the importance rating for each value is similar for 
the two groups. (Chapter 15) 


Following are a list of statistical concepts. Place a check mark beside those you 
do not know. 

Chapter 1 

Null hypothesis 
Alternative hypothesis 
Internal validity 
External validity 
Interaction 

Operational definition 
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Random sampling 


Chapter 2 

Independent variable 
Dependent variable 
Interval measurement 
Ordinal measurement 
Nominal measurement 

Chapter 3 

Control group 
Intact group 

Repeated-measures design 
Betwccn-groups design 

Chapter 5 

Percent 

Proportion 

Ratio 

Rate 

Frequency 

Chapter 6 

Range 

Variance 

Standard deviation 

Mode 

Median 

Mean 

Normal distribution 
Bimodal distribution 
z score 

Chapter 7 

Guttman/'Implicational scaling 
Percentile 

Chapter 8 

Probability 

Power 

Nonparametric procedure 

Chapter 9 

Mest 

Degrees of freedom (< df) 

Eta squared 

Mann Whitney U/Wilcoxon Ranks sums test 
Median test 

Chapter 10 
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Chapter 11 

Chapter 12 
Chapter 13 

Chapter 14 

Chapter 15 

Chapter 16 

Chapter 17 

Chapter 18 


Matched /-test 
Sign test 

Wilcoxon Matched-pairs signed-ranks test 


Analysis of variance (ANOVA) 

Omega squared 
Post hoc comparisons 

Kruskal-Wallis One-way analysis of variance test 


Friedman test 


Factorial ANOVA 
Interaction 


Chi-square test 
McNcmar's test 


Pearson correlation 
Correction for attenuation 
Spearman rho Rank-order correlation 
Point biserial correlation 
Kendall's Coefficient of concordance 


Linear regression 
Slope 

Multiple regression 


Factor analysis 
Path analysis 
Loglinear analysis 
Multidimensional scaling 
CATMOD 


Reliability 
Validity 
KR-20, KR-21 
Prophecy formula 
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Appendix B 

Answer Key to Exercises 


Chapter 1 

1.5.1 

H 0 : There is no relation between language proficiency and spelling scores. 
Interpretation? The spelling scores will not vary in relation to the language pro¬ 
ficiency of the 5s. 

Alternative hypothesis: There is a relation between language proficiency and 
spelling scores. 

Interpretation: The spelling scores will vary in relation to the language profi¬ 
ciency of the 5s. 

Directional, positive hypothesis: There is a positive relation between language 
proficiency and spelling scores. 

Directional, negative hypothesis: There is a negative relation between language 
proficiency and spelling scores. 

Interpretation: Positive-Students at the higher levels of language proficiency will 
obtain higher spelling scores; students at lower proficiency levels will obtain lower 
spelling scores. Negative—Students with lower levels of language proficiency will 
score higher on the spelling test while those with higher levels of language profi¬ 
ciency will score lower. 

1.6.1 

H 0 for sex: There is no effect of sex on spelling scores. (If this hypothesis were 
rejected, males and females would show different spelling scores.) 

H 0 for sex and LI: There is no interaction of sex and first language membership 
on spelling scores. (If this hypothesis were rejected, there might be a pattern 
where for females the scores wouldn't change across the LI groups while for men 
they would. That is, sex and LI would interact—an interaction one would not 
hope to get.) 

H 0 for sex and language proficiency: There is no interaction of sex and language 
proficiency on spelling scores. (If this hypothesis were rejected, there might be a 
pattern where for females the scores wouldn't change across the proficiency levels 
while for men they would. This is another interaction one would not hope to get.) 
H 0 for sex, LI, and language proficiency: There is no interaction of sex, language 
background, and language proficiency on spelling scores. The researcher hopes 
this interaction will not be significant. If the null hypothesis is rejected, it might 
mean, for example, that females of certain first language groups who are at cer¬ 
tain proficiency levels perform differently than others. 
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Chapter 2 
2.5.2 

Nominal variable, level, ordinal, interval 

2.6.1 

a. Variables: 1. Amount of information translated in information units. Mea¬ 
surement: frequency; or interval scores if points are set ahead of time for each 
piece of information so that a 5 would receive a total score based on number of 
correct points supplied. 2. Condition: ±pause (2 levels). Measurement: nominal. 

b. Variables: 1. Technical vocabulary problems. Measurement: interval data 
(converted from frequencies to percentages which may be interval in nature). 
Change: yes, the data would be measured as frequencies. 2. Text field (3 levels). 
Measurement: 3-level nominal category. 

c. Variables: 1. Ratings. Measurement: ratings are measured as ordinal scales. 
2. Status (2 levels--MA/Ph.D.). Measurement: nominal (l or 2). 

d. Variables: 1. Cloze test score. Measurement: interval scores. 2. Type of co¬ 
hesive tie (4 levels). Measurement: nominal variable with 4 categories. 3. LI 
membership. Measurement: nominal with 7 categories. 

e. Variables: 1. Number of misunderstandings. Measurement: frequencies. 2. 
Status (2 levels~NS and NNS) (The description is not very clear, it's possible 
that the study may have looked at dyads so that instead of student status you 
might have dyad type-NS/NS, NNS/NNS, NS/'NNS as the status variable.) 
Measurement: nominal. 

f. Variables: as above plus problem type (2 levels-easy/difficult), nominal with 
two category levels. 

2.9.1 

a. Dependent: frequency of 5 types of uncertainty expressions. Independent: 

status (teacher/student). Control: all Ss are university persons. Possible other 
intervening variables not measured: sex of teacher/'student, age of 

teacher/student, field of study/expertise of teacher/student. (This assumes that 
each of the six types of uncertainty expressions is a separate dependent variable. 
If comparisons are to be made among these six types, then the dependent variable 
would be frequency of uncertainty expressions and another independent variable 
would be type of uncertainty expression with 6 levels.) 

b. Dependent: number of information units recalled or information score (if 
points assigned and a total score computed). Independent: condition (±pause). 
Control: passage. Possible intervening variables: length of pause, proficiency 
of translators in each language, age, etc. 

c. Dependent: % technical words in problem list. Independent: text field. 
Control: all university persons; all Chinese LI. Other possible intervening vari¬ 
ables that were not measured: English proficiency level of Ss, years studying 
major. 

d. Dependent: ratings. Independent: student status. Control: all university 
persons; all in same major area of study. Other possible intervening variables 
that were not measured: sex, age, professional background of 5s. 

e. Dependent: score on cloze passage. Independent: cohesive tie type. Inde¬ 
pendent (or moderator): LI membership. Other intervening variables: ESL 
proficiency level, sex, age, years of study, etc. 
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f. Dependent; number of misunderstandings. Independent: status (NS or 
NNS). Intervening: proficiency level of Ss, matched sex or cross-sex dyads, years 
in the United States for NNSs, etc. 


Chapter 3 
3.1.1 

a. Bctween-groups. The comparison will be drawn between the responses of 
faculty who have had such students and faculty who have not. These are two 
different groups of faculty. 

b. Between-groups and repeated-measures. Between-groups since there are 
comparisons to be drawn across four different groups of students. Repeated- 
measures since the ratings given by each S to the first episode will be compared 
with the same S 's ratings of the second, and so forth across the five episodes. 

c. Between-groups. The comparison of position of the purpose clauses will be 
between oral and written corpora. The data come from two different sources. 

d. Repeated-measures. Each S's scores on pronunciation, conversation, and oral 
presentation measures are compared with his or her own scores at two-week in¬ 
tervals. Also repeated-measures: A course report with average scores at each 
point would be reported. 

e. Repeated-measures and between-groups. Each S's score in the formal context 
is compared with the same S's score in the informal context. Then, the scores of 
the nonnative speakers are compared with those of the native speakers—scores of 
two different groups are compared. 

3.2.1a. 

Faculty 

+ ESLSs NoESLSs 


Yes 

Credit 

No 
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LI Background 
NS Canad. Ind. Vietn. 

Total 

Rating 

Purpose Clauses 


Oral Written 

Precede 

Follow 
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Chapter 5 


Connector 

/ 

cum F 

to 

89 

490 

and 

72 

401 

prep 

52 

329 

because 

50 

277 

when 

50 

227 

that 

42 

177 

if 

28 

135 

but 

15 

107 

where 

13 

92 

okay 

12 

79 

or 

12 

67 

like 

11 

55 

how 

11 

44 

what 

10 

33 
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Connector 

/ 

cum F 

who 

9 

23 

well 

5 

14 

after 

4 

9 

so 

3 

5 

as 

1 

2 

why 

1 

1 


Genre 

/ 

cum F 

account 

476 

1153 

event cast 

348 

677 

operation 

121 

329 

label quest 

112 

208 

recount 

22 

96 

student quest 

18 

74 

exposition 

17 

56 

hypothesis 

16 

39 

argument 

12 

23 

meaning quest 

11 

11 


Clause 

Written Mode 

% 

Cum % 

S.Finite 

54.8% 

100% 

Complex 

32.2% 

45.2% 

S.Nonfin 

7.0% 

13.0% 

Coord. 

3.4% 

6.0% 

Frag 

1.5% 

2.6% 

Comp/Coord 

0.8% 

1.1% 

NMSub 

0.3% 

0.3% 


5.7.1 

1000 - 
800 - 
600 - 
400 - 
200 - 
0 ■ ■ 



Female 
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Chapter 6 

6.1.3 Mode = 160, Median — 160, Mean = 179.3 
6 . 2.1 



time 1 
time 2 
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6.4.1 


Mid-term Exam 


Scores 

(X-X) 

X 2 

16 

0.6 

.36 

13 

-2.4 

5.76 

13 

-2.4 

5.76 

19 

3.6 

12.96 

18 

2.6 

6.76 

15 

-0.4 

.16 

20 

4.6 

21.16 

11 

-4.4 

19.36 

14 

-1.4 

1.96 

15 

-0.4 

.16 


6.4.3 

N- 1 9 

6.5.2 


5 = v8.27 
s — 2.87 


raw score formula s - 


/ 


2446 — (23716 4- 10) 


s=J 8.27 
5 = 2.87 


Chapter 7 

7.1.1 

a. Sec 1: 54 = 62nd percentile, Sec 2: 62 = 90th percentile, Sec 3: 42 = 10th 
percentile. 

b. 520 

c. .02x759,768= 15,195 

d. .41 x 759,768 = 311,505 
.59 x 759,768 = 448,263 

7.2.1 

Score of 20 = 20th percentile 

7.2.2 

Score of 20 = within the 1st quartile 
Score of 20 = at the 2nd decile 
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7.4.1 


7.6.1 


Raw Score 

z 

38 

0 

39 

.17 

30 

-1.33 

50 

2.00 

25 

-2.17 

Raw Score 

T 

38 

50 

39 

51.7 

30 

36.7 

50 

70 

25 

28.3 


SIntro RIntro PreCI PreC2 Close Greet 


1 1 I 

1 1 1 

J_ 1 1 

f l_o 1 

0 0 1 

0 1 0 

0 0 1 

0 0 1 

0 0 1 

0 1 0 

0 0 1 

1 0 I 

0 0 0 

0 1 0 

0 0 1 

0 1 0 

0 0 1 

0 0 0 

0 0 1 


1 1 1 

I 1 1 

1 1 1 

1 1 1 

1 1 1 

1 1 1 

1 1 1 

1 1 1 

1 1 1 

1 11 

1 1 1 

1 0 1 

1 1 1 

0 1 1 

1 1 0 

0 1 1 

0 1 1 

0 1 1 

0 0 1 


Tot _ 14 5 12 7 6 13 5 14 2 17 1 18 

2 correct -2,3 correct = 5, 4 correct = 8, 5 correct = 1,6 correct = 3. 


7.6.2 

Errors are listed in the 'total' column above. Total errors — 18. Ss got item right 
when we would predict they would get it wrong. 

7.7.1 

a. 


18 

(19X6) 
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88 

114 


= .77 


improve — .84 - .77 = .07 

d. 

C 1M /= T ^ 7 =. 30 


Chapter 8 

8 . 1.1 

a. 50%, 16%, 16% (rounded from 15.87) 

b. 2% (rounded from 2.28) 

8.2.1 

.13, .02 (rounded) 

8 . 2.2 

.10, .21, .001 (rounded) 

8.2.3 

.80, .98, .93 (rounded) 

8.2.4 

0, 1.88, 1.96 

8.5.3 

a. 

Measurement: frequencies; Type of comparison: repeated-measures; 

Representativeness of sample: no information on how 25 5s were selected; 
Normal distribution: questionable, so check distribution; Wish to generalize: 
can't; Independence of measures: no, repeated-measures; Choice: nonparametric. 
Rationale: data are frequencies (not interval data), normal distribution is also 
questionable. 

b. 

Measurement: ordinal; Type of comparison: between-groups; Representativeness 
of sample: intact class is not randomly selected and it is unknown if the 5s in the 
class were randomly assigned to the two treatments. Normal distribution: very 
questionable with small n size so check. Wish to generalize: can't. Independence 
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of measures: ratings may not be entirely independent, but groups arc indepen¬ 
dent. wc hope. Choice: nonparametric. Rationale: data are rank-ordered rather 
than interval measures. Normal distribution is questionable. 

c. 

Measurement: scores. Type of comparison: between-groups. Representativeness 
of sample: samples include all members of the population. Normal distribution 
expected: yes, given the n size. Wish to generalize: perhaps to similar new classes 
at same site with same materials, etc. Independence of measures: yes, if scores 
for each area are independently given. Choice: either parametric or 
nonparametric. Rationale for parametric choice: a normal distribution will ob¬ 
tain from the large n size; while the scores arc derived from scale ratings, the 
ratings show a good distribution and are highly reliable; we believe the data are 
continuous. Since the mean and s are appropriate measures of central tendency 
and dispersion and we don't want to lose any of the information in the data, we 
want to select a parametric test. Rationale for nonparametric choice: the ratings 
arc not equal-interval and the data are not normally distributed. Therefore, the 
mean and $ are not appropriate measures. The data have continuity to the extent 
of ranks. Therefore, we will select a nonparametric test. 

d. 

Measurement: rank-order scales. Type of comparison: repeated-measures. 
Representativeness of sample: intact class so not representative. Normal distri¬ 
bution: questionable given small n size in a single class. Wish to generalize: can't. 
Independence of measures: no, since repeated-measures design. Choice: 
nonparametric. Rationale: rank-order scale data, small n size and nonrepresen¬ 
tative simple. 

e. 

Measurement: scores. Type of comparison: between matched pairs. 
Representativeness of sample: questionable since pairs had to be matched. 
Normal distribution: questionable so check. Wish to generalize: perhaps to the 
current population since pairs were chosen to represent the range of scores, but 
risky to generalize. Independence of measures: yes. Choice: both parametric and 
nonparametric procedures were run. They gave the same results and the 
nonparametric procedures were used in the final report. 


Chapter 9 

9.1.1 


14 df = 

1.76 

22 df = 

2.51 

55 df = 

1.67 

10 df = 

2.23 

n df = 

2.77 

9.1.3 



80-85 
5 4-v'TT 


‘obs = ^ 3 - 73 


576 


The Research Manual 


l cni~ 2. !(*.(//' 1.1, reject H n 


9.2.1 


f 8.4 7.7 

20 19 


(xe Xc ) “ 2.58 
63.4- 66.9 




2.58 
? o6i= -1-36 

t crit = 2-03, df= 37, cannot reject H 0 

Conclusion: there is no difference in the total test scores for the two groups. 

9.3.1 (German vocabulary example) 


V 


3.0 


3.0 2 + 35 


7) 2 =.20 

In these data, 20% of the variability in vocabulary test scores can be accounted 
for by group. 


9.3.2 (Keillor problem) 


3.73 


3.73 4- 13 


ij 2 = .52 

In this sample, 20% of the variability in vocabulary test scores can be accounted for by 
group. 

9.4.3 

Median = 12. Low Group: above 2, not above 9 (n = 11). High Group: above 
5, not above 1 (n = 12). Total above = 7, Total not above = 16. TV = 23. 


p = (2 + 5) 5- 23 = .304 
II)—(5 4- 12} 

7.304(1 - .304X1TTTTTT 12 )" 

T= 1.22 
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z cri( a. = .05 = 1.96. We cannot reject H a \ there is no meaningful difference in the 
ratings of the two groups of Ss. 

+5.1 


Score 

12 

Rank 

1 

13 

2.5 

13 

2.5 

14 

4 

15 

5 

16 

6 

17 

7.5 

17 

7.5 

18 

9.5 

18 

9.5 

19 

11 

20 

12 

N— 12 

Top Rank= 12 


+5.2 


Gp 2 

Rank 

Gp 1 

Rank 

18 

9.0 

22 

3.0 

13 

14.5 

20 

6.5 

16 

10.5 

21 

4.5 

15 

12.0 

19 

8.0 

14 

13.0 

16 

10.5 

21 

4.5 

24 

1.0 

20 

6.5 

23 

2.0 



13 

14.5 


T 2 = 70 r, = 50 
n 2 =1 n, = 8 


2(50)-8(15+ 1) 


(15+1) 


/ 8 x 7 x 


z= 1.62 

We cannot reject /7 0 . There is no significant difference in cloze test scores for the 
two language groups. 


+6.1 (composition example) 


’’-is — 38 


In this sample, 38% of the variability in composition ratings can be accounted for 
by language group. 
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9.6.2 (listening lab example) 


2 2.80 2 
" =_ 60“ 

t, 2 =.13 

Ill this sample, 13% of the variability in ratings can be accounted for by group. 


Chapter 10 
10 . 1.2 


S D 


4 


84 — (l -r 10)256 
10-1 
s D = 2.55 


2.55 


t _ 5.7-4.1 
.806 
/= 1.99 

tail = 2-26, df —9 

No, we cannot reject H 0 because t„ u is greater than t obs . We have to conclude 
that the Ss who used the special materials did not make significantly greater gains 
than those who used the traditional materials. The results cannot be generalized 
because 5s were not randomly selected. 

10.2.1 (Guadalajara problem) 


2.99 2 + 9 

In this sample, 50% of the variability in reading test scores can be accounted for 
by group. 

10.3.1 N = 36 R crlt = 11. R obs = 13. 

Interpretation: We cannot reject H 0 \ 5s did not make significant gains in listening 
scores. 
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.4.2 

- 17,/? = 14 



z = 2.23 


/f = 1.96, a = .05. Reject // a . 
i.4.3 (reduced forms practice) 



this sample. 38% of the variability in rate can be accounted for by condition. 

hapter 1J 

. 1.1 

epeated-measures 

.1.2 

:tween-groups 

.1.3 

fixed 

.3.3 

SST = 386.17 
SSB= 168.27 
SSW= 217.9 
MSB= 84.13 
MSW=%.01 
F = 10.42 

44 = 2 

4%= 27 
F C rit = 2.35 

eject the null hypothesis and conclude there is a significant difference in final 
imposition scores for subjects taught under three different methods. 
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11.4.1 (methodology example) 


2 1056- 4(95) 

" 7706 + 95 

to 2 = .09 

In this sample, 9% of the variability in exam scores can be accounted for by 
method. 

11.4.2 (composition problem) 

2 168.27 -2(8.07) 

in — - 

386.17 + 8.07 
co 2 = .39 

In this sample, 39% of the variability in composition scores can be accounted for 
by method. 

11.6.2 

Step 1: .011, Step 2: 9354.22, Step 3: 102.9, Step 4: 99, H = 3.90. 

df= 3, H crit = 7.81. Cannot reject H 0 ; conclude that students at different levels 
do not get different oral communication scores. 

11 . 8.1 

Level 1 vs. Level 2 


2(77)- 10(21) 
z — — ■ 

J 10 x 10x21 

z = -2.12 

d— 1 = 1, z crlt = 2.24 . Cannot reject H 0 \ no statistical difference between level 1 
and level 2. 

11 . 8.2 

Level 2 vs. Level 3: Z = 1.44, <7—1 = 1, Z crlt = 2.24, no statistical difference. 

Level 2 vs. Level 4: Z = 1.96, <7—1=2, 7. crlt = 2.50, no statistical difference. 

Level 3 vs. Level 4: Z = 1.13, d— 1 = 1, Z cril = 2.24, no statistical difference. 
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tap ter 12 
1.5 


One-way ANOVA with Repeated-Measures 


urce 

df 

SS 

MS 

F 

pcs (a; 

4 

1615.11 

403.78 

20.35 

bjects (S) 

8 

2970.00 



pes X Ss (A X S) 

32 

634.89 

19.84 


■tat 

44 

5220.00 




if for (4,32 dj), a = .05 = 2.67. Reject H 0 and conclude that type of lesson had 
;ignificant effect on reading test scores. 


yj2 _ _274 1 05_ _ ?8 
' 351.64 

this sample, 78% of the variability in definition scores can be accounted for 
idiom type. 


3x18x4 216 ’ 

4042.5 x .055 = 224.58 
3(18X4)= 216 
224.58 - 216= 8.58 

terpretation: error scores vary according to the feedback method. 

.3.3 (computer practice) 

2 8.58 

" =^ = - 16 

this sample, 16% of the variability in error scores can be accounted for by 
jdback type. 


*1 = 8.58 
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2. 


XRcrit = 5 - 99 

3. 

fl(g+l) 3(3+1) 12 

6« 6x18 108 

4. 

V5.99 x .111 = .82 

5. 


Xti — I -80 
Xf2 ~ ^ -64 
^73 = 2.56 



T2 

T1 

T3 


1.64 

1.80 

2.56 

T2 

1.64 

.16 

0.92 ! 

T1 

1.80 

— 

0.76 

T3 

2.56 


— 


difference is between T2 and T3 (.94 > 

.82) 


Chapter 13 
13.1.1 


^^Tota^ 

Variance Within Variance’ Between 


A B 

Method Field Method x Field 

(In)dependence 


13.1.2. 


Total Variance 

Variance Within Variance Between 


A B C 

Student ± Teacher Status x 
Status Training Tchr. Training 
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22 n 



- 0- Class 1 
Class 2 


e. There was a significant interaction between method and class. Ss in c/ass l got 
higher composition scores than 5s in class 2 EXCEPT in method 2 where class 2 
5s performed better. 


Chapter 14 
14.2.1 

H 0 : There is no relationship between problem solving type and grade on tasks. 
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low 

1 

1 

1 

2 

2 

2 

3 

3 

3 

4 
4 
4 


Col 

Obs 

E 

O-E 

(O-E) 2 

(0-E) 2 h-E 

1 

11 

10.38 

.62 

.384 

.037 

2 

9 

10.50 

-1.50 

2.25 

.214 

3 

10 

9.12 

.88 

.774 

.085 

1 

12 

11.77 

.23 

.053 

.004 

2 

11 

11.90 

-.90 

.810 

.068 

3 

11 

10.33 

.67 

.449 

.043 

1 

35 

34.27 

.73 

.533 

.016 

2 

34 

34.65 

-.65 

.422 

.012 

3 

30 

30.08 

-.08 

.006 

.001 

1 

32 

33.58 

-1.58 

2.50 

.074 

2 

37 

33.95 

3.05 

9.30 

.274 

3 

28 

29.47 

-1.47 

2.16 

.073 


X 2 = .901 


Tie critical value of x 2 for 6 df, p = .05, is 12.59. Therefore, the H 0 cannot be 
ejected. Conclude that there is no relation between problem solving type and 
;radc on tasks. Findings cannot be generalized beyond this sample of 5s. 


4.3.1 


2 100[|I9 x 17 — 31 x 33i - 100 -T- 2] 2 

X 52 x 48 x 50 x 50 

2 100[|323 - 10231 - 50] 2 

1 ~ 6240000 

x 2 =6.77 

= 3.84 , for 1 df and a = .05. Reject the H 0 and conclude that there is a rc- 
ationship between dropout rates and work hours. You can generalize to 5s in 
;imilar schools with similar characteristics and similar work, habits. 

.4.4.2 

i. US Geographic Areas 

Vote S-W S-E N-E N-W Cent 

pro 

against 

undecided 
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b. Major Korean Farsi Span Japanese 

Humanities 

SocSci 

BioSci 

phy sSei ~ 

Engr_ 

c. Age of Arrival 

Neg stage < 10 10-19 20-29 > 30 

no 4- V 
unan. do 
aux neg 
anaiz do 


d. 

Corrections/lOOOwrds F.Indep. F.Dep. 

c 5 

5-10 _ ' 

> 10 


14.4.3 

a. Yes, a Chi-square test is appropriate for the data. It is a 3 X 3 contingency 
table (3 language groups and 3 types of votes). Given that students of the entire 
community adult school system were surveyed, it seems likely that the expected 
cell frequency size would be adequate. The data are independent. 

b. Yes, a Chi-square test is appropriate. It is a 3 X 2 contingency table and since 
the records include a total school district, it seems likely that the sample size is 
large enough to get the needed expected cell frequencies. The data are inde¬ 
pendent. 

c. No, a Chi-square test is inappropriate for these data. The two samples come 
from the same children, so the data are not independent either within or between 
the cells. Collect oral language from one group and written from the other. 
Then, since the clauses don't form a logical set, decide which clauses you believe 
will differ across written and oral modes. Present a separate table for each clause. 
Since individual children might contribute more than others to the frequency 
counts in each cell, change the design so the data show how many 5s used low, 
middle, high numbers of each clause type in the two modes. 

d. No, a Chi-square test is inappropriate for these data. The two samples were 
obtained from the same teacher, so the data are not independent (either within 
or between the cells). Obtain data from two different classes. Then count the 
number of clauses (a rough measure of participation) each 5 produced in a 
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-minute class period. Use this to establish levels (low, middle, high or < 5, 
10, etc.) and count the number of Ss who fall into these levels in each treat- 
ent. 

.5.1 


<p = 


16.825 2 


916 

.14 


1.5.2 


260 

.22 


Cramer's V = 



Cramer's V = .16 


1 . 6.1 


28- 10 

738 " 


18 

738 ” 


2.92 


eject null hypothesis and conclude that vote on issue is related to the seminar 
resentation. 


Chapter 15 

5.2.2 

= .89 (z score formula) 

5.2.3 

earson correlation coefficients: Relevance and Informativeness = —.46, Rele- 
ance and Clarity = —.49, Relevance and Turn-taking = .15, Relevance and 
urn-length = .57, Informativeness and Clarity = .90, Informativeness and 
urn-taking = .23, Informativeness and Turn-Length = —.44, Clarity and 
urn-taking = .42, Clarity and Turn-Length = —.30, Turn-taking and Turn- 
ngth = .48. 
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15.4.1 

Written Error Detection 


.75 


7.8 x~.9 


= .88 


Grammar 


7.8 x .95 

The overlap of each of two types of grammar tests is high (77% for the global 
error detection test and 64% for the discrete point grammar test). If the intent 
behind the research question is to eliminate one or more tests, the decision would 
need to be made on the basis of practical substantive issues (e.g., difficulty of test 
construction, time needed for administration and correction, comprehensiveness 
of measures) in addition to statistical evidence. 

15.6.2 


r pbi 


80- 70 n 


15 


r pbi ~ 


.33 


Yes, it is a good item since it falls in the .20 to .40 range. 

15.7.1 


p = .50 

Rho c „,= .591; conclude that the two groups do not agree. 

15.8.1 


906660 — 816750 
222750 


W= .404 


15.8.2 

r= I5(9X.404) 

X 2 = 54.54 

Xcrit ~ 16.90, 9 df\ reject the H 0 and conclude that there is a consensus regarding 
useful activities. 

15.9.1 


(112 x 68) — (82 x 71) 

7(82+ 112X68 + 71X82+ 68X112 + 71) 


.066 
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t = 333(.066 z ) = 1.45 


= 3.84 

There is no relation between time of entrance to program and continuation of 
Spanish in high school. 

Chapter 16 
16.1.2 

Y = 540 + 8(25 - 30) 

Y = 500 

Difference in prediction - 40 points. 

16.2.1 

s xy = 4oJl -, 8 2 

5 ^= 24.00 

Interpret by making confidence intervals around predicted values. 

Chapter 17 
17.3.2 
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Chapter 18 
18.3.1 


1 

2 

3 

4 

5 

6 

1 

.51 

.65 

.47 

.83 

.83 

2 


.42 

.62 

.81 

.85 

3 



.63 

.52 

1.045 

4 




.59 

.725 

5 





.775 


Total = 10.275 

10 275 

Average « .685 

6(-685) 4.11 

" 1 + 5(.685) 4.425 

Reconversion to Pearson correlation 


'■//= -73 


Conclude the interrater reliability is marginal. 
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Appendix C 
Statistical Tables 

Table 1: Normal Distribution 

Table of z Scores 


(A) 

(B) 

area between 

(C) 

beyond 


(A) 

(B) 

area between 
mean and 

(C) 

beyond 


(A) 

(B) 

area between 

(C) 

beyond 

0.00 

.0000 

.5000 


0.55 

.2088 

2912 


1.10 

.3643 

.1357 

0.01 

.0040 

.4960 


0.56 

.2123 

2877 


1.1! 

.3665 

.1335 

0.02 

.0080 

.4920 


0.57 

.2157 

.2843 


1.12 

.3686 

.1314 

0.03 

.0120 

.4880 


0.58 

.2190 

.2810 


1.13 

.3708 

.1292 

0.04 

.0160 

.4840 


0.59 

.2224 

.2776 


1.14 

.3729 

.1271 

0.05 

.0199 

.4801 


0.60 

.2257 

2743 


1.15 

.3749 

.1251 

0.06 

.0239 

.4761 


0.61 

.2291 

.2709 


1.16 

.3770 

.1230 

0.07 

.0279 

.4721 


0.62 

.2324 

.2676 


1.17 

.3790 

.1210 

0.08 

.0319 

.4681 


0.63 

.2357 

.2643 


1.18 

.3810 

. 1190 

0.09 

.0359 

.4641 


0.64 

.2389 

.2611 


1.19 

.3830 

. 1170 

0.10 

.0398 

.4602 


0.65 

.2422 

.2578 


1.20 

.3849 

.1151 

0.11 

.0438 

.4562 


0.66 

.2454 

.2546 


1.21 

.3869 

.1131 

0.12 

.0478 

.4522 


0.67 

.2486 

.2514 


1.22 


.1112 

0.13 

.0517 

.4483 


0.68 

.2517 

.2483 


1.23 

.3907 

.1093 

0.14 

.0557 

.4443 


0.69 

.2549 

,2451 


1.24 

.3925 

. 1075 

0.15 

.0596 

.4404 


0.70 

.2580 

.2420 


1.25 

.3944 

.1056 

0.16 

.0636 

.4364 


0.71 

.2611 

.2389 


1.26 

.3962 

.1038 

0.17 

.0675 

.4325 


0.72 

.2642 

.2358 


1.27 

.3980 

.1020 

0.18 

.0714 

.4286 


0.73 

.2673 

.2327 


1.28 

.3997 

.1003 

0.19 

.0753 

.4247 


0.74 

.2704 

.2296 


1.29 

.4015 

.0985 

0.20 

.0793 

.4207 


0.75 

.2734 

.2266 


1.30 

.4032 

.0968 

0.21 

.0832 

.4168 


0.76 

.2764 

2236 


1.31 

.4049 

.0951 

0.22 

.0871 

.4129 


0.77 

.2794 

.2206 


1.32 

.4066 

.0934 

0.23 

.0910 

.4090 


0.78 

.2823 

.2177 


1.33 

.4082 

.0918 

0.24 

.0948 

.4052 


0.79 

.2852 

.2148 


1.34 

.4099 

.0901 

0.25 

.0987 

.4013 


0.80 

.2881 

.2119 


1.35 

.4115 

.0885 

0.26 

.1026 

.3974 


0.81 

.2910 

.2090 


1.36 

.4131 

.0869 

0.27 

.1064 

.3936 


0.82 

.2939 

.2061 


1.37 

.4147 

.0853 

0.28 

.1103 

.3897 


0.83 

.2967 

.2033 


1.38 

.4162 

.0838 

0.29 

.1141 

.3859 


0.84 

.2995 

.2005 


1.39 

.4177 

.0823 

0.30 

.1179 

.382! 


0.85 

.3023 

.1977 


1.40 

.4192 

.0808 

0.31 

.1217 

.3783 


0.86 

.3051 

.1949 


1.41 

.4207 

.0793 

0.32 

.1255 

.3745 


0.87 

.3078 

.1922 


1.42 

.4222 

.0778 

0.33 

.1793 

3707 


0 88 

3106 

1894 


1 43 

4736 

0764 

0.34 

.1331 

.3669 


0.89 

.3133 

.1867 


1.44 

.4251 

.0749 

0.35 

.1368 

.3632 


0.90 

.3159 

.1841 


1.45 

.4265 

.0735 

0.36 

.1406 

.3594 


0.91 

.3186 

.1814 


1.46 

.4279 

.0721 

0.37 

.1443 

.3557 


0.92 

.3212 

.1788 


1.47 

.4292 

.0708 

0.38 

.1480 

.3520 


0.93 

.3238 

.1762 


1.48 

.4306 

.0694 

0.39 

.1517 

.3483 


0.94 

.3264 

.1736 


1.49 

.4319 

.0681 

0.40 

.1554 

.3446 


0.95 

.3289 

17II 


1.50 

.4332 

.0668 

0.41 

.1591 

.3409 


0.96 

.3315 

.1685 


1.51 

.4345 

.0655 

0.42 

.1628 

.3372 


0.97 

.3340 

.1660 


1.52 

.4357 

.0643 

0.43 

.1664 

.3336 


0.98 

.3365 

.1635 


1.53 

.4370 

.0630 

0.44 

.1700 

.3300 


0.99 

.3389 

.1611 


1.54 

.4382 

.0618 

0.45 

.1736 

.3264 


1.00 

.3413 

.1587 


1.55 

.4394 

.0606 

0.46 

.1772 

.3228 


1.01 

.3438 

.1562 


1.56 

.4406 

.0594 

0.47 

.1808 

.3192 


1.02 

.3461 

.1539 


1.57 

.4418 

.0582 

0.48 

.1844 

.3156 


1.03 

.3485 

.1515 


1.58 

.4429 

.0571 

0.49 

.1879 

.3121 


1.04 

.3508 

.1492 


1.59 

.4441 

.0559 

0.50 

.1915 

.3085 


1.05 

.3531 

.1469 


1.60 

.4452 

.0548 

0.5! 

.1950 

.3050 


1.06 

.3554 

.1446 


1.61 

.4463 

.0537 

0.52 

.1985 

.3015 


1.07 

.3577 

.1423 


1.62 

.4474 

.0526 

0.53 

.2019 

.2981 


1.08 

.3599 

.1401 


1.63 

.4484 

.0516 

0.54 

.2054 

.2946 


1.09 

.3621 

.1379 


1.64 

.4495 

.0505 


(continued) 







F,nd your obtained z score value in column (A). The probability of this particular z score value appears in column 
(C). Column C gives the probability of obtaining a z score this high or this low in the distribution. To find the 
proportion of the curve below the z score, add .50 to the figure in column (B). The critical z values to remember in 
hypothesis testing are: 2 — 1.96 for p = .05 and z = 2.57 for p = .01 for two-tailed tests. The table is read in 
precisely the same way whether the obtained value of z is positive or negative. 
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Table 2: t-test 


Critical Values for t 


p 

.10 

.05 

.02 

.01 

.001 

df 1 

6.314 

12.706 

31.821 

63.657 

636.619 

2 

2.920 

4.303 

6.965 

9.925 

31.598 

3 

2.353 

3.182 

4.541 

5.841 

12.941 

4 

2.132 

2.776 

3.747 

4.604 

8.610 

5 

2.015 

2.571 

3.365 

4.032 

6,859 

6 

1.943 

2.447 

3.143 

3.707 

5.959 

7 

1.895 

2.365 

2.998 

3.499 

5.405 

8 

1.860 

2.306 

2.896 

3.355 

5.041 

9 

1.833 

2.262 

2.821 

3.250 

4.781 

10 

1.812 

2.228 

2.764 

3.169 

4.587 

11 

1.796 

2.201 

2.718 

3.106 

4.437 

12 

1.782 

2.179 

2.681 

3.055 

4.318 

13 

1.771 

2.160 

2.650 

3.012 

4.221 

14 

1.761 

2.145 

2.624 

2.977 

4.140 

15 

1.753 

2.131 

2.602 

2.947 

4.073 

16 

1.746 

2.120 

2.583 

2.921 

4.015 

17 

1.740 

2.110 

2.567 

2.898 

3.965 

18 

1.734 

2.101 

2.552 

2.878 

3.922 

19 

1.729 

2.093 

2.539 

2.861 

3.883 

20 

1.725 

2.086 

2.528 

2.845 

3.850 

21 

1.721 

2.080 

2.518 

2.831 

3.819 

22 

1.717 

2.074 

2.508 

2.819 

3.792 

23 

1.714 

2.069 

2.500 

2.807 

3.767 

24 

1.711 

2.064 

2.492 

2.797 

3.745 

25 

1.708 

2.060 

2.485 

2.787 

3.725 

26 

1.706 

2.056 

2.479 

2.779 

3.707 

27 

1.703 

2.052 

2.473 

2.771 

3.690 

28 

1.701 

2.048 

2.467 

2.763 

3.674 

29 

1.699 

2.045 

2.462 

2.756 

3.659 

30 

1.697 

2.042 

2.457 

2.750 

3.646 

40 

1.684 

2.021 

2.423 

2.704 

3.551 

60 

1.671 

2.000 

2.390 

2.660 

3.460 

120 

1.658 

1.980 

2.358 

2.617 

3.373 
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Find the row for your df in the first column. Find the column with the appro¬ 
priate p level. The intersection of this row and column shows the value needed 
to reject the H 0 . The observed level of t must be equal to or greater than this 
critical value. The critical values are the same for both positive and negative t 
values. Notice that not all df are given after 30. If the df you need is not in the 
table, use the row before as a conservative estimate of the critical value needed 
to reject the null hypothesis. Thus, if the df is 32, use the critical value for 30 df 
as a rough estimate. 
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Table 3: Sign Test 


Critical Values of R for the Sign Test 


p 

.05 

.01 

N 1 

2 

3 

4 

5 

6 

0 


7 

0 


8 

0 

0 

9 

1 

0 

10 

1 

0 

11 

1 

0 

12 

2 

1 

13 

2 

1 

14 

2 

1 

15 

.3 

2 

16 

3 

2 

17 

4 

2 

18 

4 

3 

19 

4 

3 

20 

5 

3 

21 

5 

4 

22 

5 

4 

23 

6 

4 

24 

6 

5 

25 

7 

5 

26 

7 

6 

27 

7 

6 

28 

8 

6 

29 

8 

7 

30 

9 

7 

35 

11 

9 

40 

13 

11 

45 

15 

13 

50 

17 

15 


Find the appropriate row for the number of pairs in the first column. Find the 
column for the selected level of probability. Check the intersection to find the 
number of changes required to reject the H 0 . The observed number of changes 
must be less than the critical number of changes. 
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Table 4: VViScoxon Matched-Pairs Signed-Ranks Test 

Critical Values of T for Wilcoxon Matched-Pairs Signed-Ranks 


p 

.05 

.025 

.01 

N 6 

0 


— 

7 

2 

0 

— 

8 

4 

2 

0 

9 

6 

3 

2 

10 

8 

5 

3 

11 

11 

7 

5 

12 

14 

10 

7 

13 

17 

13 

10 

14 

21 

16 

13 

15 

25 

20 

16 

16 

30 

24 

20 

17 

35 

28 

23 

18 

40 

33 

28 

19 

46 

38 

32 

20 

52 

43 

38 

21 

59 

49 

43 

22 

66 

56 

49 

23 

73 

62 

55 

24 

81 

69 

61 

25 

89 

77 

68 


Find the appropriate row and the column for the selected level of probability. 
Check the intersection for the sum of ranks required to reject the H 0 . The ob¬ 
served sum of ranks must be equal to or less than the critical sum of ranks given 
in the intersection. 
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Table 5: F distribution (ANOVA) 


Critical Values of F 


df 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

161 

200 

216 

225 

230 

234 

237 

239 

241 

242 


4052 

4999 

5403 

5625 

5764 

5859 

5928 

5981 

6022 

6056 

2 

18.51 

19.00 

19.16 

19.25 

19.30 

19.33 

19.36 

19.37 

19.38 

19.39 


98.49 

99.01 

99.17 

99.25 

99.30 

99.33 

99.34 

99.36 

99.38 

99.40 

3 

10.13 

9.55 

9.28 

9.12 

9.01 

8.94 

8.88 

8.84 

8.81 

8.78 


34.12 

30.81 

29.46 

28.71 

28.24 

27.91 

27.67 

27.49 

27.34 

27.23 

4 

7.71 

6.94 

6.59 

6.39 

6.26 

6.16 

6.09 

6.04 

6.00 

5.96 


21.20 

18.00 

16.69 

15.98 

15.52 

15.21 

14.98 

14.80 

14.66 

14.54 

5 

6.61 

5.79 

5.41 

5.19 

5.05 

4.95 

4.88 

4.82 

4.78 

4.74 


16.26 

13.27 

12.06 

11.39 

10.97 

10.67 

10.45 

10.27 

10.15 

10.05 

6 

5.99 

5.14 

4.76 

4.53 

4.39 

4.28 

4.21 

4.15 

4.10 

4.06 


13.74 

10.92 

9.78 

9.15 

8.75 

8.47 

8.26 

8.10 

7.98 

7.87 

7 

5.59 

4.74 

4.35 

4.12 

3.97 

3.87 

3.79 

3.73 

3.68 

3.63 


12.25 

9.55 

8.45 

7.85 

7.46 

7.19 

7.00 

6.84 

6.71 

6.62 

8 

5.32 

4.46 

4.07 

3.84 

3.69 

3.58 

3.50 

3.44 

3.39 

3.34 


11.26 

8.65 

7.59 

7.01 

6.63 

6.37 

6.19 

6.03 

5.91 

5.82 

9 

5.12 

4.26 

3.86 

3.63 

3.48 

3.37 

3.29 

3.23 

3.18 

3.13 


10.56 

8.02 

6.99 

6.42 

6.06 

5.80 

5.62 

5.47 

5.35 

5.26 

10 

4.96 

4.10 

3.71 

3.48 

3.33 

3.22 

3.14 

3.07 

3.02 

2.97 


10.04 

7.56 

6.55 

5.99 

5.64 

5.39 

5.21 

5.06 

4.95 

4.85 

11 

4.84 

3.98 

3.59 

3.36 

3.20 

3.09 

3.01 

2.95 

2.90 

2.86 


9.65 

7.20 

6.22 

5.67 

5.32 

5.07 

4.88 

4.74 

4.63 

4.54 

12 

4.75 

3.88 

3.49 

3.26 

3.11 

3.00 

2.92 

2.85 

2.80 

2.76 


9.33 

6.93 

5.95 

5.41 

5.06 

4.82 

4.65 

4.50 

4.39 

4.30 

13 

4.67 

3.80 

3.41 

3.18 

3.02 

2.92 

2.84 

2.77 

2.72 

2.67 


9.07 

6.70 

5.74 

5.20 

4.86 

4.62 

4.44 

4.30 

4.19 

4.10 

14 

4.60 

3.74 

3.34 

3.11 

2.96 

2.85 

2.77 

2.70 

2.65 

2.60 


8.86 

6.51 

5.56 

5.03 

4.69 

4.46 

4.28 

4.14 

4.03 

3.94 

15 

4.54 

3.68 

3.29 

3.06 

2.90 

2.79 

2.70 

2.64 

2.59 

2.55 


8.68 

6.36 

5.42 

4.89 

4.56 

4.32 

4.14 

4.00 

3.89 

3.80 
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itical Values of F 


If 

i 

2 

3 

4 

5 

6 

7 

8 

9 

10 

6 

4.49 

3.63 

3.24 

3.01 

2.85 

2.74 

2.66 

2.59 

2.54 

2.49 


8.53 

6.23 

5.29 

4.77 

4.44 

4.20 

4.03 

3.89 

3.78 

3.69 

7 

4.45 

3.59 

3.20 

2.96 

2.81 

2.70 

2.62 

2.55 

2.50 

2.45 


8.40 

6.11 

5.18 

4.67 

4.34 

4.10 

3.93 

3.79 

3.68 

3.59 

8 

4.41 

3.55 

3.16 

2.93 

2.77 

2.66 

2.58 

2.51 

2.46 

2.41 


8.28 

6.01 

5.09 

4.58 

4.25 

4.01 

3.85 

3.71 

3.60 

3.51 

9 

4.38 

3.52 

3.13 

2.90 

2.74 

2.63 

2.55 

2.48 

2.43 

2.38 


8.18 

5.93 

5.01 

4.50 

4.17 

3.94 

3.77 

3.63 

3.52 

3.43 

to 

4.35 

3.49 

3.10 

2.87 

2.71 

2.60 

2.52 

2.45 

2.40 

2.35 


8.10 

5.85 

4.94 

4.43 

4.10 

3.87 

3.71 

3.56 

3.45 

3.37 

’1 

4.32 

3.47 

3.07 

2.84 

2.68 

2.57 

2.49 

2.42 

2.37 

2.32 


8.02 

5.78 

4.87 

4.37 

4.04 

3.81 

3.65 

3.51 

3.40 

3.31 

12 

4.30 

3.44 

3.05 

2.82 

2.66 

2.55 

2.47 

2.40 

2.35 

2.30 


7.94 

5.72 

4.82 

4.31 

3.99 

3.76 

3.59 

3.45 

3.35 

3.26 

>3 

4.28 

3.42 

3.03 

2.80 

2.64 

2.53 

2.45 

2.38 

2.32 

2.28 


7.88 

5.66 

4.76 

4.26 

3.94 

3.71 

3.54 

3.41 

3.30 

3.21 

>4 

4.26 

3.40 

3.01 

2.78 

2.62 

2.51 

2.43 

2.36 

2.30 

2.26 


7.82 

5.61 

4.72 

4.22 

3.90 

3.67 

3.50 

3.36 

3.25 

3.17 

15 

4.24 

3.38 

2.99 

2.76 

2.60 

2.49 

2.41 

2.34 

2.28 

2.24 


7.77 

5.57 

4.68 

4.18 

3.86 

3.63 

3.46 

3.32 

3.21 

3.13 

16 

4.22 

3.37 

2.89 

2.74 

2.59 

2.47 

2.39 

2.32 

2.27 

2.22 


7.72 

5.53 

4.64 

4.14 

3.82 

3.59 

3.42 

3.29 

3.17 

3.09 

11 

4.21 

3.35 

2.96 

2.73 

2.57 

2.46 

2.37 

2.30 

2.25 

2.20 


7.68 

5.49 

4.60 

4.11 

3.79 

3.56 

3.39 

3.26 

3.14 

3.06 

>8 

4.20 

3.34 

2.95 

2.71 

2.56 

2.44 

2.36 

2.29 

2.24 

2.19 


7.64 

5.45 

4.57 

4.07 

3.76 

3.53 

3.36 

3.23 

3.11 

3.03 

19 

4.18 

3.33 

2.93 

2.70 

2.54 

2.43 

2.35 

2.28 

2.22 

2.18 


7.60 

5.52 

4.54 

4.04 

3.73 

3.50 

3.32 

3.20 

3.08 

3.00 

)0 

4.17 

3.32 

2.92 

2.69 

2.53 

2.42 

2.34 

2.27 

2.21 

2.16 


7.56 

5.39 

4.51 

4.02 

3.70 

3.47 

3.30 

3.17 

3.06 

2.98 

n 

4.15 

3.30 

2.90 

2.67 

2.51 

2.40 

2.32 

2.25 

2.19 

2.14 


7.50 

5.34 

4.46 

3.97 

3.66 

3.42 

3.25 

3.12 

3.01 

2.94 
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Critical Values of F 


df 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

34 

4.13 

3.28 

2.88 

2.65 

2.49 

2.38 

2.30 

2.23 

2.17 

2.12 


7.44 

5.29 

4.42 

3.93 

3.61 

3.38 

3.21 

3.08 

2.97 

2.89 

36 

4.11 

3.26 

2.86 

2.63 

2.48 

2.36 

2.28 

2.21 

2.15 

2.10 


7.39 

5.25 

4.38 

3.89 

3.58 

3.35 

3.18 

3.04 

2.94 

2.86 

38 

4.10 

3.25 

2.85 

2.62 

2.46 

2.35 

2.26 

2.19 

2.14 

2.09 


7.35 

5.21 

4.34 

3.86 

3.54 

3.32 

3.15 

3.02 

2.91 

2.82 

40 

4.08 

3.23 

2.84 

2.61 

2.45 

2.34 

2.25 

2.18 

2.12 

2.07 


7.31 

5.18 

4.31 

3.83 

3.51 

3.29 

3.12 

2.99 

2.88 

2.80 

42 

4.07 

3.22 

2.83 

2.59 

2.44 

2.32 

2.24 

2.17 

2.11 

2.06 


7.27 

5.15 

4.29 

3.80 

3.49 

3.26 

3.10 

2.96 

2.86 

2.77 

44 

4.06 

3.21 

2.82 

2.58 

2.43 

2.31 

2.23 

2.16 

2.10 

2.05 


7.24 

5.12 

4.26 

3.78 

3.46 

3.24 

3.07 

2.94 

2.84 

2.75 

46 

4.05 

3.20 

2.81 

2.57 

2.42 

2.30 

2.22 

2.14 

2.09 

2.04 


7.21 

5.10 

4.24 

3.76 

3.44 

3.22 

3.05 

2.92 

2.82 

2.73 

48 

4.04 

3.19 

2.80 

2.56 

2.41 

2.30 

2.21 

2.14 

2.08 

2.03 


7.19 

5.08 

4.22 

3.74 

3.42 

3.20 

3.04 

2.90 

2.80 

2.71 

50 

4.03 

3.18 

2.79 

2.56 

2.40 

2.29 

2.20 

2.13 

2.07 

2.02 


7.17 

5.06 

4.20 

3.72 

3.41 

3.18 

3.02 

2.88 

2.78 

2.70 

55 

4.02 

3.17 

2.78 

2.54 

2.38 

2.27 

2.18 

2.11 

2.05 

2.00 


7.12 

5.01 

4.16 

3.68 

3.37 

3.15 

2.98 

2.85 

2.75 

2.66 

60 

4.00 

3.15 

2.76 

2.52 

2.37 

2.25 

2.17 

2.10 

2.04 

1.99 


7.08 

4.98 

4.13 

3.65 

3.34 

3.12 

2.95 

2.82 

2.72 

2.63 

65 

3.99 

3.14 

2.75 

2.51 

2.36 

2.24 

2.15 

2.08 

2.02 

1.98 


7.04 

4.95 

4.10 

3.62 

3.31 

3.09 

2.93 

2.79 

2.70 

2.61 

70 

3.98 

3.13 

2.74 

2.50 

2.35 

2.32 

2.14 

2.07 

2.01 

1.97 


7.01 

4.92 

4.08 

3.60 

3.29 

3.07 

2.91 

2.77 

2.67 

2.59 

80 

3.96 

3.11 

2.72 

2.48 

2.33 

2.21 

2.12 

2.05 

1.99 

1.95 


4.06 

4.88 

4.04 

3.56 

3.25 

3.04 

2.87 

2.74 

2.64 

2.55 

100 

3.94 

3.09 

2.70 

2.46 

2.30 

2.19 

2.10 

2.03 

1.97 

1.92 


6.90 

4.82 

3.98 

3.51 

3.20 

2.99 

2.82 

2.69 

2.59 

2.51 


Down the side of the table, find the df for the degrees of freedom within 
(N — K). Then find the df across the top of the table for df between (K — 1). In 
the intersection of the row and the column, you will find the F crU needed to reject 
the H 0 . The upper value is for p = . 05, the lower value is for p = . 01. If the ob¬ 
served value of F is equal to or larger than the value shown in the table, you can 
reiect the H.. 


able 6: Ryan's Procedure for Kruskal-Wallis 


a 

7 

6 

5 

d- l 

4 

3 

2 

1 


8 

3.13 

3.09 

3.03 

2.96 

2.86 

2.74 

2.50 


7 


3.04 

2.99 

2.92 

2.83 

2.69 

2.45 


6 



2.94 

2.87 

2.77 

2.63 

2.40 


5 




2.81 

2.72 

2.58 

2.33 


4 





2.64 

2.50 

2.24 


3 






2.40 

2.13 



. the above table, a is the number of groups, d is the number of levels spanned 
7 the comparison. For example, if you have four groups, you will look along the 
w labeled 4. Then, if you want to compare group 1 with group 4, that spans 
ur levels (which is d). d — 1 = 3 , so you look down the column labeled 3 for the 
tcrsection (2.64). This is the critical Z value. 
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Table 7: Chi-square Distribution 

Table of Critical Values for Chi-square (x 2 ) 


p 

.10 

.05 

.025 

.01 

.001 

df 1 

2.706 

3.841 

5.024 

6.635 

10.828' 

2 

4.605 

5.991 

7.378 

9.210 

13.816 

3 

6.251 

7.815 

9.348 

11.345 

16.266 

4 

7.779 

9.488 

1 1.143 

13.277 

18.467 

5 

9.236 

11.070 

12.832 

15.086 

20.515 

6 

10.645 

12.592 

14.449 

16.812 

22.458 

7 

12.017 

14.067 

16.013 

18.475 

24.322 

8 

13.362 

15.507 

17.535 

20.090 

26.125 

9 

14.684 

16.919 

19.023 

21.666 

27.877 

10 

15.987 

18.307 

20.483 

23.209 

29.588 

1 i 

17.275 

19.675 

21.920 

24.725 

31.264 

12 

18.549 

21.026 

23.337 

26.217 

32.909 

13 

19.812 

22.362 

24.736 

27.688 

34.528 

14 

21.064 

23.685 

26.119 

29.141 

36.123 

15 

22.307 

24,996 

27.488 

30.578 

37.697 

16 

23.542 

26.296 

28.845 

31.100 

39.252 

17 

24.769 

27.587 

30.191 

33.409 

40.790 

18 

25.989 

28.869 

31.526 

24.805 

42.312 

19 

27.204 

30.144 

32.852 

36.191 

43.820 

20 

28.412 

31.410 

34.170 

37.566 

45.315 

25 

34.382 

37.653 

40.647 

44.314 

50.620 

30 

40.256 

43.773 

46.979 

50.892 

59.703 

40 

51.805 

55.759 

59.342 

63.691 

73.402 

50 

63.167 

67.505 

71.420 

76.154 

86.661 

60 

74.397 

79.082 

83.298 

88.379 

99.607 

70 

85.527 

90.531 

95.023 

100.425 

112.317 

80 

96.578 

101.879 

106.629 

112.329 

124.839 

90 

107.565 

113.145 

118.136 

124.116 

137.208 

100 

118.498 

124.342 

129.561 

135.807 

149.449 


Look for the appropriate degrees of freedom down the side of the table. Then 
locate the column for the probability level you selected to test the H a . Check the 
intersection of this row and column for the critical value. To reject the H 0 , the 
observed x 2 value must be equal to or greater than the critical value shown in the 
table. 


Appendix C: Statistical Tables 603 



Table 8: Pearson Product-Moment Correlation 

Critical Values for Pearson Product-Moment Correlation 


r= ,v _2 

p .05 

.01 

.001 

1 

.9969 

.9999 

1.000 

2 

.9500 

.9900 

.9990 

3 

.8783 

.9587 

.9912 

4 

.8114 

.9172 

.9741 

5 

.7545 

.8745 

.9507 

6 

.7067 

.8343 

.9249 

7 

.6664 

.7977 

.8982 

8 

.6319 

.7646 

.8721 

9 

.6021 

.7348 

.8471 

10 

.5760 

.7079 

.8233 

11 

.5529 

.6835 

.8010 

12 

.5324 

.6614 

.7800 

13 

.5139 

.6411 

.7603 

14 

.4973 

.6226 

.7420 

15 

.4821 

.6055 

.7246 

16 

.4683 

.5897 

7084 

17 

.4555 

.5751 

.6932 

18 

.4438 

.5614 

.6787 

19 

.4329 

.5487 

.6652 

20 

.4227 

.5368 

.6524 

25 

.3809 

.4869 

.5974 

30 

.3494 

.4487 

.5541 

35 

.3246 

.4182 

.5189 

40 

.3044 

.3932 

.4896 

45 

.2875 

.3721 

.4648 

50 

.2732 

.3541 

.4433 

60 

.2500 

.3248 

.4078 

70 

.2319 

.3017 

.3799 

80 

.2172 

.2830 

.3568 

90 

.2050 

.2673 

.3375 

100 

.1946 

.2540 

.3211 


First find the degrees of freedom for your study (number of pairs minus 2). Then 
find the column for the level of significance you have selected. The observed 
value of r must be greater than or equal to the value in the intersection of this 
column and line. The critical values are the same for positive and negative cor¬ 
relations. 
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Table 9: Spearman Rho Correlation 


Critical Values of Spearman rho Rank-Order Correlation 


p 

.05 

.01 

N 5 

1.000 

_ 

6 

.886 

1.000 

7 

.786 

.929 

8 

.738 

.881 

9 

.683 

.833 

10 

.648 

.794 

12 

.591 

.777 

14 

.544 

.714 

16 

.506 

.665 

18 

.475 

.625 

20 

.450 

.591 

22 

.428 

.562 

24 

.409 

.537 

26 

.392 

.515 

28. 

.377 

.496 

30 

.364 

.478 


N = Number of pairs 

Columns labeled .05 and .01 = the probability level 

To reject H 0> the attained value of rho must be greater than ur equal to the tabled 
value. 
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[able 10: Fisher Z-Transformation of Correlations 


r 

Z 

r 

Z 

r 

Z 

.00 

.000 

.34 

.354 

.67 

.811 

.01 

.010 

.35 

.365 

.68 

.829 

.02 

.020 

.36 

.377 

.69 

.848 

.03 

.030 

.37 

.388 

.70 

.867 

.04 

.040 

.38 

.400 

.71 

.887 

.05 

.050 

.39 

.412 

.72 

.908 

.06 

.060 

.40 

.424 

.73 

.929 

.07 

.070 

.41 

.436 

.74 

.950 

.08 

.080 

.42 

.448 

.75 

.973 

.09 

.090 

.43 

.460 

.76 

.996 

.10 

.100 

.44 

.472 

.77 

1.020 

.11 

.110 

.45 

.485 

.78 

1.045 

.12 

.121 

.46 

.497 

.79 

1.071 

.13 

.131 

.47 

.510 

.80 

1.099 

.14 

.141 

.48 

.523 

.81 

1.127 

.15 

.151 

.49 

.536 

.82 

1.157 

.16 

.161 

.50 

.549 

.83 

1.188 

.17 

.172 

.51 

.563 

.84 

1.221 

.18 

.182 

.52 

.576 

.85 

1.256 

.19 

.192 

.53 

.590 

.86 

1.293 

.20 

.203 

.54 

.604 

.87 

1.333 

.21 

.213 

.55 

.618 

.88 

1.376 

.22 

.224 

.56 

.633 

.89 

1.422 

.23 

.234 

.57 

.648 

.90 

1.472 

.24 

.245 

.58 

.662 

.91 

1.528 

.25 

.255 

.59 

.678 

.92 

1.589 

.26 

.266 

.60 

.693 

.93 

1.658 

.27 

.277 

.61 

.709 

.94 

1.738 

.28 

.288 

.62 

.725 

.95 

1.832 

.29 

.299 

.63 

.741 

.96 

1.946 

.30 

.310 

.64 

.758 

.97 

2.092 

.31 

.321 

.65 

.775 

.98 

2.298 

.32 

.332 

.66 

.793 

.99 

2.647 

.33 

.343 



1.00 

11.859 


~ind the value of the uncorrected correlation in the columns labeled r. Correct 
he correlation to the Z value given in the column to the right of the r value. 
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Appendix D 
List of Formulas 


Chapter 5 


Descriptive statistics 

Cumulative frequency = successive addition of frequency 

o # 0[X 

Proportion =-— 

total 

Percentage = 100 x proportion 
Cumulative percent = successive additions of percent 
Rate — relative frequency per unit 

# of X 


Ratio = - 


‘#of Y 


Chapter 6 
Central tendency 

Mode = most frequently obtained score 
Median = center of distribution 


Formulas for variability 


Mean = X — 


Ya 


Range — Xf t ^ esl Xlowest 


Variance = 


N - 1 


Variance — 


N- 1 
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hapter 7 


Percentile = (100 


F below the score ■+ 1/2 F same score 


T score = (I0Xz) + 50 


juttman formulas 


Coefficient of reproducibility = C — 1 - 


Minimum marginal reproducibility = MM re 


number of errors 
(# SsX# items) 

maximum marginals 


6 K rep (# SsX# items) 

% improvement in reproducibility = C rep - MM rep 

„ ^ 0 % improvement in reproducibility 

Coefficient of scalability --:-rrrr- 


Chapter 9 

standard error of means 
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Case 1 t formula 


*obs ~ 


X — n 


'ase 2 t formula 


*obs 


^Xe Xc) 

Standard error of differences between X's 


Median test 


Rank sums test 


;ta 2 for f-test 


\x,-x c )~ 


T= - 


(A-r n 2 ) 

'Jpi 1 ~p\i A-n, + 1 ^n 2 ) 
where p = (A + B)+ N 

2T X - n { (N + 1) 


/ («iX«; 


h)(A’+l) 


':ta 2 for Rank sums 


t 2 + df 


I 2 - 


N- 1 


Chapter 10 
Matched (paired ) t 


*matched 

Standard error of difference between Xs 
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tandard error of differences 


S D = 


Vilcoxon Matched-pairs 




N(N+ I) 




N(N + l\2N+ 1) 
24 


Zhapter II 
KNOVA f-ratio 


!um of squares total 


Fnhx = 


«-2 

°between 
o 2 

° within 


MSB 
MS W 


SST = Z X2 -^ -=£( A '-* G ) 2 

>um of squares between 


r&o 2 & 2 , 

— — + —— + "' + —*r~ 


& 


Sum of squares within 
Mean squares between 


Mean squares within 


sss = £«[*- x c f 

SSW= SST-SSB 

MSB 

4jB ^ 1 


MSW — 


ssw 

dfw 


ssw 

N-K 
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eta 2 for unbalanced or repeated-measures designs 
2 SS X 
n SST 

omega 2 for balanced designs, between-groups 

2 SSB-(K- \)MSW 


SST + MS W 


Kruskal-Wallis test 


3(jV+1} 


eta 2 for Kruskal-Wallis 


..2 H 


' N- 1 

Ryan's Procedure for Kruskal-Wallis 

27}- nf,N+ 1) 






Chapter 12 

Repeated-measures ANOVA formulas 

g=t a\ + t a2 + t a3 + "‘+ T an 
,2 
N~ 

^2 


SST 




SS A 


z* 

A= 1 


* n A~ 


N 


SSc 




SS 4 r= SST- SS.-SSr 


F— - 


MS, 


MS, 
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Friedman 


y 2 R =- 12 


(aX'Xa+ 1) 


Z t\ - + 1) 


Chapter 13 

Factorial Anova formulas 


4 


r FactorA 2 Msw 
S W 


F SB MSb 

'FactorB 2 Msw 

S W 


S AB MS AB 


S W 


SST — > X 


= Z^ 


MSW 


&X) 2 


xf 


ss,= 


<Zad 2 (Za 2 ) 2 <Za 3 ) 2 (Za 4 ) 2 i (I- 

"1 + n 2 + "3 + "4 N 

(Z sc ° res/, i) 2 (Z sc ° res/< 2) 2 n (Za) 2 


"41 


J 


(^scores#!) 2 (^scoresl^)' 


+ - 


n Bl n B2 

SS ab =SSB-{SS a + SS b ) 


(Za ) 2 


2 SS A - (df A )MSW 
a A SST + MSW 


2 SS g -(df B iMSiV) 
a B- 


SST+ MSW 
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2 SS ab -W A bWSW) 

0)48 SST+MSW 

Chapter 14 


Chi-square 

2 (observed - expected) 2 

* 2 —» expected 

Chi-square expected cell 


n t x ns 

E _ J 

•J N 

Yates' x 2 

2 N(\ad- bc\- N H-2) 2 

* {a + b\c + di_a + c\b + d) 

Phi 

1 2 

* = J— 

V N 

Cramer's V 



Cramer 1 sV — j— — : -—- 

V ( min r — 1, c — 1) 

McNemar's test 

B-C 


■Jb+c 

Chapter 15 


Pearson correlation 

N* 1 

II 

J* 


A'(£fF)-(£f)(E F ) 

r xy ~ ' 

A ' 2 - (E^^E y2 - <E r > 2] 
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Correction for attenuation 


r CAp¬ 


point biserial correlation 


Spearman (rho) correlation 


Kendall's concordance 


x ? x i ,— 

r pbi= -5—-W 


p = I - 


N(N 2 - I) 


12^R 2 - im 2 n{n+ l) 2 

Ml 


Correction for ties 


Phi correlation 


W= 


x = m{n - IX 


12TV 2 - 3m 2 «(n+ l) 2 

Ml _ 

■n 2 n(n 2 — l)-mV(t 3 -() 


_ BC-AD _ 

Ph -J(A + BIC + DXA + CIB + D) 

X 2 = N Phi 2 


r P hi = 


Chapter 16 

Regression formulas 


Pik - Pi Pk 
■JPi AiPk Qk 


b — 


iV>V-(^r) 2 
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b = r. 


s Y 


Y (predicted Y) = Y + b{X- X) 
Y = a + bx 

J\r- h 2 


Error variance = - 


N-2 


SEE = 


N-2 


SEE=s y J 1 - r] x 


Chapter 18 
Reliability formulas 

Interrater reliability for more than two raters: 

nr AB 


“ 1 +(n-\)r AB 


Internal consistency: 


lr A 


l + r AB 


K-R20 =-—-[■ 


2 V 2 
/i r S * ~ 2ji ■ 


n- 1 1 


K-R 21 
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Appendix E 

Selected Journals in Applied 
Linguistics 


The following list of journals is not meant to be exhaustive. Rather, it gives an 
indication of some of the journals you might wish to consult for research ideas 
or for articles that address research questions that interest you. 


Journal 

Publisher 

Applied Psycholinguistics 

Oxford Univ. Press 

Applied Linguistics 

Cambridge Univ. Press 

Brain and Language 

Academic Press 

CALICO Journal 

Brigham Young Univ. 

Cognition 

Elsevier Publishers 

Cognition & Instruction 

Lawrence Erlbaum 

College English 

NCTE 

College Composition 

NCTE 

& Communication 

Discourse Processes 

Ablex Publishers 

English for Special 

Pergamom Press 

Purposes Journal 

English Language Teaching Journal 

Oxford Univ. Press 

English Today 

Cambridge Univ. Press 

English World-wide 

John Benjamins 

Foreign Language Annals 

ACTFL 

IAL (Issues in 

UCLA 

Applied Linguistics) 

IRAL 

J. Groos Publishers 

Journal of Communication 

Oxford Univ. Press 

Journal of Cross-Cultural Psychology 

Sage Press 

Journal of Language & Social Psychology 

Multilingual Matters 

Journal of Multilingual and 

Multilingual Matters 

Multicultural Development 

Journal of Pragmatics 

North-Holland 

Journal of Psycho- 

Plenum Press 

linguistic Research 

Language 

-inguistic Society 
jf America 

Language and 

Lawrence Erlbaum 

Cognitive Processes 

Language & Communication 

Pergamon Press 
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nguage & Speech 

nguage in Society 

nguage Learning 

.nguage Problems & Language 

Planning 

inguage Testing 

nguistics 

nguistic Analysis 

nguistics and Education 

ind and Language 

odern Language Journal 

ABE Journal 

eading in a Foreign Language 
eading Research Quarterly 
ELC Journal 
esearch in the Teaching 
of English 

xond Language Research 
ign Language Studies 
tudics in Language 
tudies in Second 
Language Acquisition 
YSTEM 

eaching English in the 
Two Year College 
caching English to Deaf and 
Second Language Students 
'ESL Canada Journal 
’ESL Reporter 
'ESOL Quarterly 
i/ord 

Vorld Englishes 
Written Communication 


Kingston Press 
Cambridge Univ. Press 
Concordia University 
U. Texas Press 

Edward Arnold 
Mouton de Gruyter 
American Elsevier 
Ablex Publishers 
Basil Blackwell 
NFMLTA 
NABE 

lnternat'1 Educ. Centre 
lnternat'1 Reading Assoc. 
SEAMEO 
NCTE 

Edward Arnold 
Linstok Press 
John Benjamins 
Cambridge Univ. Press 

Pergamon Press 
NCTE 

Gallaudet University 

TESL Canada 
BYU Hawaii 
TESOL 

lnternat'1 Linguistic Assoc. 
Pergamon Press 
Sage Press 
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